DataCamp offer interactive courses related to R Programming. While some is review, it is helpful to see other perspectives on material. As well, DataCamp has some interesting materials on packages that I want to learn better (ggplot2, dplyr, ggvis, etc.). This document summarizes a few key insights from:
This document is currently split between _v003 and _v003_a and _v003_b due to the need to keep the number of DLL that it opens below the hard-coded maximum. This introductory section needs to be re-written, and the contents consolidated, at a future date.
The original DataCamp_Insights_v001 and DataCamp_Insights_v002 documents have been split for this document:
Chapter 1 - Introduction
Problems in spatial statistics:
Simulation and testing with spatstat:
Further testing:
Example code includes:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
# The number of points to create
n <- 200
# Set the range
xmin <- 0
xmax <- 1
ymin <- 0
ymax <- 2
# Sample from a Uniform distribution
x <- runif(n, xmin, xmax)
y <- runif(n, ymin, ymax)
# The ratio of the Y axis scale to the X axis scale is called the aspect ratio of the plot. Spatial data should always be presented with an aspect ratio of 1:1.
# See pre-defined variables
# ls.str()
# Plot points and a rectangle
mapxy <- function(a = NA){
plot(x, y, asp = a)
rect(xmin, ymin, xmax, ymax)
}
mapxy(1)
# How do we create a uniform density point pattern in a circle?
# We might first try selecting radius and angle uniformly. But that produces a "cluster" at small distances
# Instead we sample the radius from a non-uniform distribution that scales linearly with distance, so we have fewer points at small radii and more at large radii
# This exercise uses spatstat's disc() function, that creates a circular window.
# Load the spatstat package
library(spatstat)
## Loading required package: spatstat.data
## Loading required package: nlme
##
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
##
## collapse
## Loading required package: rpart
##
## spatstat 1.55-0 (nickname: 'Stunned Mullet')
## For an introduction to spatstat, type 'beginner'
##
## Note: R version 3.3.3 (2017-03-06) is more than 9 months old; we strongly recommend upgrading to the latest version
# Create this many points, in a circle of this radius
n_points <- 300
radius <- 10
# Generate uniform random numbers up to radius-squared
r_squared <- runif(n_points, 0, radius**2)
angle <- runif(n_points, 0, 2*pi)
# Take the square root of the values to get a uniform spatial distribution
x <- sqrt(r_squared) * cos(angle)
y <- sqrt(r_squared) * sin(angle)
plot(spatstat::disc(radius))
points(x, y)
# Some variables have been pre-defined
# ls.str()
# Set coordinates and window
ppxy <- ppp(x = x, y = y, window = disc(radius))
# Test the point pattern
qt <- quadrat.test(ppxy)
## Warning: Some expected counts are small; chi^2 approximation may be
## inaccurate
# Inspect the results
plot(qt)
print(qt)
##
## Chi-squared test of CSR using quadrat counts
## Pearson X2 statistic
##
## data: ppxy
## X2 = 19.395, df = 24, p-value = 0.5388
## alternative hypothesis: two.sided
##
## Quadrats: 25 tiles (irregular windows)
# In the previous exercise you used a set of 300 events scattered uniformly within a circle
# If you repeated the generation of the events again you will still have 300 of them, but in different locations
# The dataset of exactly 300 points is from a Poisson point process conditioned on the total being 300
# The spatstat package can generate Poisson spatial processes with the rpoispp() function given an intensity and a window, that are not conditioned on the total
# Just as the random number generator functions in R start with an "r", most of the random point-pattern functions in spatstat start with an "r".
# The area() function of spatstat will compute the area of a window such as a disc
# Create a disc of radius 10
disc10 <- disc(10)
# Compute the rate as count divided by area
lambda <- 500 / area(disc10)
# Create a point pattern object
ppois <- rpoispp(lambda = lambda, win = disc10)
# Plot the Poisson point pattern
plot(ppois)
# The spatstat package also has functions for generating point patterns from other process modelsparameters.
# These generally fall into one of two classes: clustered processes, where points occur together more than under a uniform Poisson process,
# and regular (aka inhibitory) processes where points are more spaced apart than under a uniform intensity Poisson process
# Some process models can generate patterns on a continuum from clustered through uniform to regular depending on their parameters
# The quadrat.test() function can test against clustered or regular alternative hypotheses
# By default it tests against either of those, but this can be changed with the alternative parameter to create a one-sided test.
# A Thomas process is a clustered pattern where a number of "parent" points, uniformly distributed, create a number of "child" points in their neighborhood
# The child points themselves form the pattern. This is an attractive point pattern, and makes sense for modelling things like trees, since new trees will grow near the original tree
# Random Thomas point patterns can be generated using rThomas()
# This takes three numbers that determine the intensity and clustering of the points, and a window object.
# Conversely the points of a Strauss process cause a lowering in the probability of finding another point nearby
# The parameters of a Strauss process can be such that it is a "hard-core" process, where no two points can be closer than a set threshold
# Creating points from this process involves some clever simulation algorithms
# This is a repulsive point pattern, and makes sense for modelling things like territorial animals, since the other animals of that species will avoid the territory of a given animal
# Random Strauss point patterns can be generated using rStrauss()
# This takes three numbers that determine the intensity and "territory" of the points, and a window object
# Points generated by a Strauss process are sometimes called regularly spaced.
# Create a disc of radius 10
disc10 <- disc(10)
# Generate clustered points from a Thomas process
set.seed(123)
p_cluster <- rThomas(kappa = 0.35, scale = 1, mu = 3, win = disc10)
plot(p_cluster)
# Run a quadrat test
quadrat.test(p_cluster, alternative = "clustered")
## Warning: Some expected counts are small; chi^2 approximation may be
## inaccurate
##
## Chi-squared test of CSR using quadrat counts
## Pearson X2 statistic
##
## data: p_cluster
## X2 = 53.387, df = 24, p-value = 0.0005142
## alternative hypothesis: clustered
##
## Quadrats: 25 tiles (irregular windows)
# Regular points from a Strauss process
set.seed(123)
p_regular <- rStrauss(beta = 2.9, gamma = 0.025, R = .5, W = disc10)
## Warning: Simulation will be performed in the containing rectangle and
## clipped to the original window.
plot(p_regular)
# Run a quadrat test
quadrat.test(p_regular, alternative = "regular")
## Warning: Some expected counts are small; chi^2 approximation may be
## inaccurate
##
## Chi-squared test of CSR using quadrat counts
## Pearson X2 statistic
##
## data: p_regular
## X2 = 8.57, df = 24, p-value = 0.001619
## alternative hypothesis: regular
##
## Quadrats: 25 tiles (irregular windows)
# Another way of assessing clustering and regularity is to consider each point, and how it relates to the other points
# One simple measure is the distribution of the distances from each point to its nearest neighbor
# The nndist() function in spatstat takes a point pattern and for each point returns the distance to its nearest neighbor
# Instead of working with the nearest-neighbor density, as seen in the histogram, it can be easier to work with the cumulative distribution function, G(r)
# This is the probability of a point having a nearest neighbour within a distance r
# For a uniform Poisson process, G can be computed theoretically, and is G(r) = 1 - exp( - lambda * pi * r ^ 2)
# You can compute G empirically from your data using Gest() and so compare with the theoretical value.
# Events near the edge of the window might have had a nearest neighbor outside the window, and so unobserved
# This will make the distance to its observed nearest neighbor larger than expected, biasing the estimate of G
# There are several methods for correcting this bias
# Plotting the output from Gest shows the theoretical cumulative distribution and several estimates of the cumulative distribution using different edge corrections
# Often these edge corrections are almost indistinguishable, and the lines overlap
# The plot can be used as a quick exploratory test of complete spatial randomness
# Two ppp objects, p_poisson, and p_regular are defined for you
# Point patterns are pre-defined
p_poisson <- ppois
p_poisson
## Planar point pattern: 513 points
## window: polygonal boundary
## enclosing rectangle: [-10, 10] x [-10, 10] units
p_regular
## Planar point pattern: 325 points
## window: polygonal boundary
## enclosing rectangle: [-10, 10] x [-10, 10] units
# Calc nearest-neighbor distances for Poisson point data
nnd_poisson <- nndist(p_poisson)
# Draw a histogram of nearest-neighbor distances
hist(nnd_poisson)
# Estimate G(r)
G_poisson <- Gest(p_poisson)
# Plot G(r) vs. r
plot(G_poisson)
# Repeat for regular point data
nnd_regular <- nndist(p_regular)
hist(nnd_regular)
G_regular <- Gest(p_regular)
plot(G_regular)
# A number of other functions of point patterns have been developed
# They are conventionally denoted by various capital letters, including F, H, J, K and L
# The K-function is defined as the expected number of points within a distance of a point of the process, scaled by the intensity
# Like G, this can be computed theoretically for a uniform Poisson process and is K(r) = pi * r ^ 2 - the area of a circle of that radius
# Deviation from pi * r ^ 2 can indicate clustering or point inhibition
# Computational estimates of K(r) are done using the Kest() function.
# As with G calculations, K-function calculations also need edge corrections
# The default edge correction in spatstat is generally the best, but can be slow, so we'll use the "border" correction for speed here
# Uncertainties on K-function estimates can be assessed by randomly sampling points from a uniform Poisson process in the area and computing the K-function of the simulated data
# Repeat this process 99 times, and take the minimum and maximum value of K over each of the distance values
# This gives an envelope - if the K-function from the data goes above the top of the envelope then we have evidence for clustering
# If the K-function goes below the envelope then there is evidence for an inhibitory process causing points to be spaced out
# Envelopes can be computed using the envelope() function
# The plot method for estimates of K uses a formula system where a dot on the left of a formula refers to K®
# So the default plot uses . ~ r
# You can compare the estimate of K to a Poisson process by plotting . - pi * r ^ 2 ~ r
# If the data was generated by a Poisson process, then the line should be close to zero for all values of r
# Point patterns are pre-defined
p_poisson
## Planar point pattern: 513 points
## window: polygonal boundary
## enclosing rectangle: [-10, 10] x [-10, 10] units
p_cluster
## Planar point pattern: 332 points
## window: polygonal boundary
## enclosing rectangle: [-10, 10] x [-10, 10] units
p_regular
## Planar point pattern: 325 points
## window: polygonal boundary
## enclosing rectangle: [-10, 10] x [-10, 10] units
# Estimate the K-function for the Poisson points
K_poisson <- Kest(p_poisson, correction = "border")
# The default plot shows quadratic growth
plot(K_poisson, . ~ r)
# Subtract pi * r ^ 2 from the Y-axis and plot
plot(K_poisson, . - pi * r**2 ~ r)
# Compute envelopes of K under random locations
K_cluster_env <- envelope(p_cluster, Kest, correction = "border")
## Generating 99 simulations of CSR ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
## 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,
## 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99.
##
## Done.
# Insert the full formula to plot K minus pi * r^2
plot(K_cluster_env, . - pi * r^2 ~ r)
# Repeat for regular data
K_regular_env <- envelope(p_regular, Kest, correction = "border")
## Generating 99 simulations of CSR ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
## 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,
## 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99.
##
## Done.
plot(K_regular_env, . - pi * r^2 ~ r)
Chapter 2 - Point Pattern Analysis
Bivariate point problems:
Spatial segregation:
Space-time data:
Space-time clustering:
Example code includes:
# The dataset we shall use for this example consists of crimes in a 4km radius of the center of Preston, a town in north-west England
# We want to look for hotspots of violent crime in the area
# A ppp object called preston_crime has been constructed
# This is a marked point process, where each point is marked as either a "Violent Crime" or a "Non-violent crime"
# The marks for each point can be retrieved using the marks() function
# The window is a 4km circle centered on the town center
# A map image of the town from OpenStreetMap has also been loaded, called preston_osm
preston_crime <- readRDS("./RInputFiles/pcrime-spatstat.RDS")
preston_osm <- readRDS("./RInputFiles/osm_preston_gray.RDS")
# Get some summary information on the dataset
summary(preston_crime)
## Marked planar point pattern: 2036 points
## Average intensity 4.053214e-05 points per square unit
##
## Coordinates are given to 2 decimal places
## i.e. rounded to the nearest multiple of 0.01 units
##
## Multitype:
## frequency proportion intensity
## Non-violent crime 1812 0.8899804 3.607281e-05
## Violent crime 224 0.1100196 4.459332e-06
##
## Window: polygonal boundary
## single connected closed polygon with 99 vertices
## enclosing rectangle: [349773, 357771] x [425706.5, 433705.5] units
## Window area = 50231700 square units
## Fraction of frame area: 0.785
# Get a table of marks
table(marks(preston_crime))
##
## Non-violent crime Violent crime
## 1812 224
# Define a function to create a map
preston_map <- function(cols = c("green","red"), cex = c(1, 1), pch = c(1, 1)) {
raster::plotRGB(preston_osm) # from the raster package
plot(preston_crime, cols = cols, pch = pch, cex = cex, add = TRUE, show.window = TRUE)
}
# Draw the map with colors, sizes and plot character
preston_map(
cols = c("black", "red"),
cex = c(0.5, 1),
pch = 19
)
# One method of computing a smooth intensity surface from a set of points is to use kernel smoothing
# Imagine replacing each point with a dot of ink on absorbent paper
# Each individual ink drop spreads out into a patch with a dark center, and multiple drops add together and make the paper even darker
# With the right amount of ink in each drop, and with paper of the right absorbency, you can create a fair impression of the density of the original points
# In kernel smoothing jargon, this means computing a bandwidth and using a particular kernel function
# To get a smooth map of violent crimes proportion, we can estimate the intensity surface for violent and non-violent crimes, and take the ratio
# To do this with the density() function in spatstat, we have to split the points according to the two values of the marks and then compute the ratio of the violent crime surface to the total
# The function has sensible defaults for the kernel function and bandwidth to guarantee something that looks at least plausible
# preston_crime has been pre-defined
preston_crime
## Marked planar point pattern: 2036 points
## Multitype, with levels = Non-violent crime, Violent crime
## window: polygonal boundary
## enclosing rectangle: [349773, 357771] x [425706.5, 433705.5] units
# Use the split function to show the two point patterns
crime_splits <- split(preston_crime)
# Plot the split crime
plot(crime_splits)
# Compute the densities of both sets of points
crime_densities <- density(crime_splits)
# Calc the violent density divided by the sum of both
frac_violent_crime_density <- crime_densities[[2]] /
(crime_densities[[1]] + crime_densities[[2]])
# Plot the density of the fraction of violent crime
plot(frac_violent_crime_density)
# We can get a more principled measure of the violent crime ratio using a spatial segregation model
# The spatialkernel package implements the theory of spatial segregation
# The first step is to compute the optimal bandwidth for kernel smoothing under the segregation model
# A small bandwidth would result in a density that is mostly zero, with spikes at the event locations
# A large bandwidth would flatten out any structure in the events, resulting in a large "blob" across the whole window
# Somewhere between these extremes is a bandwidth that best represents an underlying density for the process
# spseg() will scan over a range of bandwidths and compute a test statistic using a cross-validation method
# The bandwidth that maximizes this test statistic is the one to use
# The returned value from spseg() in this case is a list, with h and cv elements giving the values of the statistic over the input h values
# The spatialkernel package supplies a plotcv function to show how the test value varies
# The hcv element has the value of the best bandwidth
# spatstat is loaded and the preston_crime object is read in
# Scan from 500m to 1000m in steps of 50m
bw_choice <- spatialkernel::spseg(
preston_crime,
h = seq(500, 1000, by = 50),
opt = 1)
##
## Calculating cross-validated likelihood function
# Plot the results and highlight the best bandwidth
spatialkernel::plotcv(bw_choice)
abline(v = bw_choice$hcv, lty = 2, col = "red")
# Print the best bandwidth
print(bw_choice$hcv)
## [1] 800
# The second step is to compute the probabilities for violent and non-violent crimes as a smooth surface, as well as the p-values for a point-wise test of segregation
# This is done by calling spseg() with opt = 3 and a fixed bandwidth parameter h
# Normally you would run this process for at least 100 simulations, but that will take too long to run here
# Instead, run for only 10 simulations
# Then you can use a pre-loaded object seg which is the output from a 1000 simulation run that took about 20 minutes to complete
# Set the correct bandwidth and run for 10 simulations only
seg10 <- spatialkernel::spseg(
pts = preston_crime,
h = bw_choice$hcv,
opt = 3,
ntest = 10,
proc = FALSE)
# Plot the segregation map for violent crime
spatialkernel::plotmc(seg10, "Violent crime")
# Plot seg, the result of running 1000 simulations (not included here)
# spatialkernel::plotmc(seg, "Violent crime")
# With a base map and some image and contour functions we can display both the probabilities and the significance tests over the area with more control than the plotmc() function.
# The seg object is a list with several components
# The X and Y coordinates of the grid are stored in the $gridx and $gridy elements
# The probabilities of each class of data (violent or non-violent crime) are in a matrix element $p with a column for each class
# The p-value of the significance test is in a similar matrix element called $stpvalue
# Rearranging columns of these matrices into a grid of values can be done with R's matrix() function
# From there you can construct list objects with a vector $x of X-coordinates, $y of Y-coordinates, and $z as the matrix
# You can then feed this to image() or contour() for visualization
# This process may seem complex, but remember that with R you can always write functions to perform complex tasks and those you may repeat often
# For example, to help with the mapping in this exercise you will create a function that builds a map from four different items
# The seg object from 1000 simulations is loaded, as well as the preston_crime points and the preston_osm map image
# Inspect the structure of the spatial segregation object
# str(seg)
# Get the number of columns in the data so we can rearrange to a grid
# ncol <- length(seg$gridx)
# Rearrange the probability column into a grid
# prob_violent <- list(x = seg$gridx,
# y = seg$gridy,
# z = matrix(seg$p[, "Violent crime"],
# ncol = ncol))
# image(prob_violent)
# Rearrange the p-values, but choose a p-value threshold
# p_value <- list(x = seg$gridx,
# y = seg$gridy,
# z = matrix(seg$stpvalue[, "Violent crime"] < 0.05,
# ncol = ncol))
# image(p_value)
# Create a mapping function
# segmap <- function(prob_list, pv_list, low, high){
#
# # background map
# plotRGB(preston_osm)
#
# # p-value areas
# image(pv_list,
# col = c("#00000000", "#FF808080"), add = TRUE)
#
# # probability contours
# contour(prob_list,
# levels = c(low, high),
# col = c("#206020", "red"),
# labels = c("Low", "High"),
# add = TRUE)
#
# # boundary window
# plot(Window(preston_crime), add = TRUE)
# }
#
# # Map the probability and p-value
# segmap(prob_violent, p_value, 0.05, 0.15)
# The sasquatch, or "bigfoot", is a large ape-like creature reported to live in North American forests
# The Bigfoot Field Researchers Organization maintains a database of sightings and allows its use for teaching and research
# A cleaned subset of data in north-west USA has been created as the ppp object sasq and is loaded for you to explore the space-time pattern of sightings in the area
# Get a quick summary of the dataset
sasq <- readRDS("./RInputFiles/sasquatch.RDS")
summary(sasq)
## Marked planar point pattern: 423 points
## Average intensity 2.097156e-09 points per square unit
##
## *Pattern contains duplicated points*
##
## Coordinates are given to 1 decimal place
## i.e. rounded to the nearest multiple of 0.1 units
##
## Mark variables: date, year, month
## Summary:
## date year month
## Min. :1990-05-03 Y2004 : 41 Sep : 59
## 1st Qu.:2000-04-30 Y2000 : 36 Oct : 56
## Median :2003-11-05 Y2002 : 30 Aug : 54
## Mean :2003-08-11 Y2005 : 30 Jul : 50
## 3rd Qu.:2007-11-02 Y2001 : 26 Nov : 43
## Max. :2016-04-05 Y2008 : 26 Jun : 41
## (Other):234 (Other):120
##
## Window: polygonal boundary
## single connected closed polygon with 64 vertices
## enclosing rectangle: [368187.8, 764535.6] x [4644873, 5434933] units
## Window area = 2.01702e+11 square units
## Fraction of frame area: 0.644
# Plot unmarked points
plot(unmark(sasq))
# Plot the points using a circle sized by date
plot(sasq, which.marks = "date")
# Show the available marks
names(marks(sasq))
## [1] "date" "year" "month"
# Histogram the dates of the sightings, grouped by year
hist(marks(sasq)$date, "years", freq = TRUE)
# Plot and tabulate the calendar month of all the sightings
plot(table(marks(sasq)$month))
# Split on the month mark
sasq_by_month <- split(sasq, "month", un = TRUE)
# Plot monthly maps
plot(sasq_by_month)
# Plot smoothed versions of the above split maps
plot(density(sasq_by_month))
# To do a space-time clustering test with stmctest() from the splancs package, you first need to convert parts of your ppp object
# Functions in splancs tend to use matrix data instead of data frames.
# To run stmctest() you need to set up
# event locations
# event times
# region polygon
# time limits
# the time and space ranges for analysis
# The sasq object is loaded and the spatstat and splancs packages are ready for use
# Get a matrix of event coordinates
sasq_xy <- as.matrix(spatstat::coords(sasq))
# Check the matrix has two columns
dim(sasq_xy)
## [1] 423 2
# Get a vector of event times
sasq_t <- marks(sasq)$date
# Extract a two-column matrix from the ppp object
sasq_poly <- as.matrix(as.data.frame(Window(sasq)))
dim(sasq_poly)
## [1] 64 2
# Set the time limit to 1 day before and 1 day after the range of times
tlimits <- range(sasq_t) + c(-1, 1)
# Scan over 400m intervals from 100m to 20km
s <- seq(100, 20000, by = 400)
# Scan over 14 day intervals from one week to 31 weeks
tm <- seq(7, 217, by = 14)
# Everything is now ready for you to run the space-time clustering test function
# You can then plot the results and compute a p-value for rejecting the null hypothesis of no space-time clustering
# Any space-time clustering in a data set will be removed if you randomly rearrange the dates of the data points
# The stmctest() function computes a clustering test statistic for your data based on the space-time K-function - how many points are within a spatial and temporal window of a point of the data
# It then does a number of random rearrangements of the dates among the points and computes the clustering statistic
# After doing this a large number of times, you can compare the test statistic for your data with the values from the random data
# If the test statistic for your data is sufficiently large or small, you can reject the null hypothesis of no space-time clustering
# The output from stmctest() is a list with a single t0 which is the test statistic for your data, and a vector of t from the simulations
# By converting to data frame you can feed this to ggplot functions
# Because the window area is a large number of square meters, and we have about 400 events, the numerical value of the intensity is a very small number
# This makes values of the various K-functions very large numbers, since they are proportional to the inverse of the intensity
# Don't worry if you see 10^10 or higher
# The p-value of a Monte-Carlo test like this is just the proportion of test statistics that are larger than the value from the data
# You can compute this from the t and t0 elements of the output
# All the objects from the previous exercise are loaded.
# Run 999 simulations
sasq_mc <- splancs::stmctest(sasq_xy, sasq_t, sasq_poly, tlimits, s, tm, nsim = 999, quiet = TRUE)
names(sasq_mc)
## [1] "t0" "t"
# Histogram the simulated statistics and add a line at the data value
ggplot(data.frame(sasq_mc), aes(x = t)) +
geom_histogram(binwidth = 1e13) +
geom_vline(aes(xintercept = t0))
# Compute the p-value as the proportion of tests greater than the data
sum(sasq_mc$t > sasq_mc$t0) / 1000
## [1] 0.04
Chapter 3 - Areal Statistics
Areal statistics:
Spatial health data:
Generalized linear models in space:
Correlation in spatial GLM:
Example code includes:
library(cartogram)
library(rgeos)
## rgeos version: 0.3-26, (SVN revision 560)
## GEOS runtime version: 3.6.1-CAPI-1.10.1 r0
## Linking to sp version: 1.2-7
## Polygon checking: TRUE
library(spdep)
## Loading required package: sp
## Loading required package: Matrix
## Loading required package: spData
## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge')`
##
## Attaching package: 'spData'
## The following objects are masked _by_ '.GlobalEnv':
##
## x, y
library(epitools)
library(R2BayesX)
## Loading required package: BayesXsrc
## Loading required package: colorspace
##
## Attaching package: 'colorspace'
## The following object is masked from 'package:spatstat':
##
## coords
## Loading required package: mgcv
## This is mgcv 1.8-17. For overview type 'help("mgcv-package")'.
# In 2016 the UK held a public vote on whether to remain in the European Union
# The results of the referendum, where people voted either "Remain" or "Leave", are available online
# The data set london_ref contains the results for the 32 boroughs of London, and includes the number and percentage of votes in each category as well as the count of spoilt votes, the population size and the electorate size
# The london_ref object is a SpatialPolygonsDataFrame, a special kind of data frame where each row also has the shape of the borough
# It behaves like a data frame in many respects, but can also be used to plot a choropleth, or shaded polygon, map
# You should start with some simple data exploration and mapping. The following variables will be useful:
# NAME : the name of the borough.
# Electorate : the total number of people who can vote.
# Remain, Leave : the number of votes for "Remain" or "Leave".
# Pct_Remain, Pct_Leave : the percentage of votes for each sid
# spplot() from the raster package provides a convenient way to draw a shaded map of regions
# See what information we have for each borough
london_ref <- readRDS("./RInputFiles/london_eu.RDS")
summary(london_ref)
## Object of class SpatialPolygonsDataFrame
## Coordinates:
## min max
## x 503574.2 561956.7
## y 155850.8 200933.6
## Is projected: TRUE
## proj4string :
## [+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000
## +y_0=-100000 +datum=OSGB36 +units=m +no_defs +ellps=airy
## +towgs84=446.448,-125.157,542.060,0.1502,0.2470,0.8421,-20.4894]
## Data attributes:
## NAME TOTAL_POP Electorate Votes_Cast
## Length:32 Min. :157711 Min. : 83042 Min. : 54801
## Class :character 1st Qu.:237717 1st Qu.:143458 1st Qu.:104079
## Mode :character Median :272017 Median :168394 Median :116280
## Mean :270780 Mean :169337 Mean :118025
## 3rd Qu.:316911 3rd Qu.:196285 3rd Qu.:134142
## Max. :379691 Max. :245349 Max. :182570
## Remain Leave Rejected_Ballots Pct_Remain
## Min. : 27750 Min. :17138 Min. : 60.0 Min. :30.34
## 1st Qu.: 55973 1st Qu.:32138 1st Qu.:105.0 1st Qu.:53.69
## Median : 70254 Median :45263 Median :138.0 Median :61.01
## Mean : 70631 Mean :47255 Mean :139.0 Mean :60.46
## 3rd Qu.: 84287 3rd Qu.:59018 3rd Qu.:164.2 3rd Qu.:69.90
## Max. :118463 Max. :96885 Max. :267.0 Max. :78.62
## Pct_Leave Pct_Rejected Assembly
## Min. :21.38 Min. :0.0600 Length:32
## 1st Qu.:30.10 1st Qu.:0.0875 Class :character
## Median :38.99 Median :0.1100 Mode :character
## Mean :39.54 Mean :0.1187
## 3rd Qu.:46.31 3rd Qu.:0.1500
## Max. :69.66 Max. :0.2200
# Which boroughs voted to "Leave"?
london_ref$NAME[london_ref$Leave > london_ref$Remain]
## [1] "Sutton" "Barking and Dagenham" "Bexley"
## [4] "Havering" "Hillingdon"
# Plot a map of the percentage that voted "Remain"
sp::spplot(london_ref, zcol = "Pct_Remain")
# Large areas, such as cities or countries, are often divided into smaller administrative units, often into zones of approximately equal population
# But the area of those units may vary considerably
# When mapping them, the large areas carry more visual "weight" than small areas, although just as many people live in the small areas.
# One technique for correcting for this is the cartogram
# This is a controlled distortion of the regions, expanding some and contracting others, so that the area of each region is proportional to a desired quantity, such as the population
# The cartogram also tries to maintain the correct geography as much as possible, by keeping regions in roughly the same place relative to each other
# The cartogram package contains functions for creating cartograms
# You give it a spatial data frame and the name of a column, and you get back a similar data frame but with regions distorted so that the region area is proportional to the column value of the regions
# You'll also use the rgeos package for computing the areas of individual regions with the gArea() function
# Use the cartogram and rgeos packages (called at top of routine)
# library(cartogram)
# library(rgeos)
# Make a scatterplot of electorate vs borough area
names(london_ref)
## [1] "NAME" "TOTAL_POP" "Electorate"
## [4] "Votes_Cast" "Remain" "Leave"
## [7] "Rejected_Ballots" "Pct_Remain" "Pct_Leave"
## [10] "Pct_Rejected" "Assembly"
plot(london_ref$Electorate, gArea(london_ref, byid = TRUE))
# Make a cartogram, scaling the area to the electorate
carto_ref <- cartogram(london_ref, "Electorate")
## Mean size error for iteration 1: 1.5881743190908
## Mean size error for iteration 2: 1.32100446455657
## Mean size error for iteration 3: 1.18227723476121
## Mean size error for iteration 4: 1.10676057030171
## Mean size error for iteration 5: 1.0657703107641
## Mean size error for iteration 6: 1.04259259672006
## Mean size error for iteration 7: 1.02832326230708
## Mean size error for iteration 8: 1.01931941526112
## Mean size error for iteration 9: 1.01341424685212
## Mean size error for iteration 10: 1.00941370606418
## Mean size error for iteration 11: 1.00663364742297
## Mean size error for iteration 12: 1.00470553629914
## Mean size error for iteration 13: 1.00336434720465
## Mean size error for iteration 14: 1.00241457265516
## Mean size error for iteration 15: 1.00174179254187
plot(carto_ref)
# Check the linearity of the electorate-area plot
plot(carto_ref$Electorate, gArea(carto_ref, byid = TRUE))
# Make a fairer map of the Remain percentage
sp::spplot(carto_ref, "Pct_Remain")
# The map of "Remain" votes seems to have spatial correlation
# Pick any two boroughs that are neighbors - with a shared border - and the chances are they'll be more similar than any two random boroughs
# This can be a problem when using statistical models that assume, conditional on the model, that the data points are independent
# The spdep package has functions for measures of spatial correlation, also known as spatial dependency
# Computing these measures first requires you to work out which regions are neighbors via the poly2nb() function, short for "polygons to neighbors"
# The result is an object of class nb
# Then you can compute the test statistic and run a significance test on the null hypothesis of no spatial correlation
# The significance test can either be done by Monte-Carlo or theoretical models
# In this example you'll use the Moran "I" statistic to test the spatial correlation of the population and the percentage "Remain" vote.
# The london_ref spatial data object is loaded for you
# Use the spdep package (called at top of routine)
# library(spdep)
# Make neighbor list
borough_nb <- poly2nb(london_ref)
# Get center points of each borough
borough_centers <- coordinates(london_ref)
# Show the connections
plot(london_ref)
plot(borough_nb, borough_centers, add = TRUE)
# Map the total pop'n
sp::spplot(london_ref, zcol = "TOTAL_POP")
# Run a Moran I test on total pop'n
moran.test(
london_ref$TOTAL_POP,
nb2listw(borough_nb)
)
##
## Moran I test under randomisation
##
## data: london_ref$TOTAL_POP
## weights: nb2listw(borough_nb)
##
## Moran I statistic standard deviate = 1.2124, p-value = 0.1127
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.11549264 -0.03225806 0.01485190
# Map % Remain
sp::spplot(london_ref, zcol = "Pct_Remain")
# Run a Moran I MC test on % Remain
moran.mc(
london_ref$Pct_Remain,
nb2listw(borough_nb),
nsim = 999
)
##
## Monte-Carlo simulation of Moran I
##
## data: london_ref$Pct_Remain
## weights: nb2listw(borough_nb)
## number of simulations + 1: 1000
##
## statistic = 0.42841, observed rank = 1000, p-value = 0.001
## alternative hypothesis: greater
# The UK's National Health Service publishes weekly data for consultations at a number of "sentinel" clinics and makes this data available
# A dataset for one week in February 2017 has been loaded for you to analyze
# It is called london, and contains data for the 32 boroughs.
# You will focus on reports of "Influenza-like illness", or more simply "Flu"
# Your first task is to map the "Standardized Morbidity Ratio", or SMR
# This is the number of cases per person, but scaled by the overall incidence so that the expected number is 1
# The london object, a spatial data frame, and the sp package are ready for you
# Get a summary of the data set
london <- readRDS("./RInputFiles/london_2017_2.RDS")
summary(london)
## Object of class SpatialPolygonsDataFrame
## Coordinates:
## min max
## x 503574.2 561956.7
## y 155850.8 200933.6
## Is projected: TRUE
## proj4string :
## [+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000
## +y_0=-100000 +datum=OSGB36 +units=m +no_defs +ellps=airy
## +towgs84=446.448,-125.157,542.060,0.1502,0.2470,0.8421,-20.4894]
## Data attributes:
## CODE NAME Flu_OBS Vom_OBS
## Length:32 Length:32 Min. : 0.00 Min. : 0.00
## Class :character Class :character 1st Qu.: 11.00 1st Qu.:14.00
## Mode :character Mode :character Median : 33.00 Median :40.00
## Mean : 38.56 Mean :37.28
## 3rd Qu.: 61.00 3rd Qu.:57.50
## Max. :112.00 Max. :96.00
## Diarr_OBS Gastro_OBS TOTAL_POP CCGcode
## Min. : 0.00 Min. : 0.0 Min. :157711 Length:32
## 1st Qu.: 22.50 1st Qu.: 48.0 1st Qu.:237717 Class :character
## Median : 62.00 Median :120.5 Median :272017 Mode :character
## Mean : 57.03 Mean :113.7 Mean :270780
## 3rd Qu.: 90.75 3rd Qu.:176.8 3rd Qu.:316911
## Max. :122.00 Max. :251.0 Max. :379691
## CCG.geography.code CCG.name Asthma_Prevalence
## Length:32 Length:32 Min. :3.550
## Class :character Class :character 1st Qu.:4.412
## Mode :character Mode :character Median :4.660
## Mean :4.624
## 3rd Qu.:4.925
## Max. :5.470
## Obesity_Prevalence Cancer_Prevalence Diabetes_Prevalence Income
## Min. : 3.930 Min. :0.870 Min. :3.620 Min. :0.0730
## 1st Qu.: 5.855 1st Qu.:1.438 1st Qu.:5.265 1st Qu.:0.1308
## Median : 7.565 Median :1.605 Median :6.305 Median :0.1665
## Mean : 7.585 Mean :1.684 Mean :6.245 Mean :0.1655
## 3rd Qu.: 8.810 3rd Qu.:1.903 3rd Qu.:7.067 3rd Qu.:0.1985
## Max. :12.130 Max. :2.540 Max. :9.060 Max. :0.2530
## Employment Education HealthDeprivation Crime
## Min. :0.0570 Min. : 3.958 Min. :-1.4100 Min. :-0.1550
## 1st Qu.:0.0905 1st Qu.:10.047 1st Qu.:-0.5055 1st Qu.: 0.3745
## Median :0.1095 Median :13.925 Median :-0.2050 Median : 0.5515
## Mean :0.1092 Mean :14.024 Mean :-0.2044 Mean : 0.5379
## 3rd Qu.:0.1283 3rd Qu.:17.480 3rd Qu.: 0.2010 3rd Qu.: 0.7823
## Max. :0.1560 Max. :27.182 Max. : 0.5430 Max. : 1.0190
## Services Environment i
## Min. :19.63 Min. :13.37 Min. : 0.00
## 1st Qu.:24.43 1st Qu.:24.03 1st Qu.: 7.75
## Median :30.41 Median :28.20 Median :15.50
## Mean :29.55 Mean :31.38 Mean :15.50
## 3rd Qu.:34.74 3rd Qu.:40.15 3rd Qu.:23.25
## Max. :41.89 Max. :55.00 Max. :31.00
# Map the OBServed number of flu reports
sp::spplot(london, "Flu_OBS")
# Compute and print the overall incidence of flu
r <- sum(london$Flu_OBS) / sum(london$TOTAL_POP)
r
## [1] 0.0001424128
# Calculate the expected number for each borough
london$Flu_EXP <- london$TOTAL_POP * r
# Calculate the ratio of OBServed to EXPected
london$Flu_SMR <- london$Flu_OBS / london$Flu_EXP
# Map the SMR
sp::spplot(london, "Flu_SMR")
# SMRs above 1 represent high rates of disease - but how high does an SMR need to be before it can be considered statistically significant?
# Given a number of cases and a population, its possible to work out confidence intervals at some level of the estimate of the ratio of cases per population using the properties of the binomial distribution
# The epitools package has a function binom.exact() which you can use to compute confidence intervals for the flu data
# These can be scaled to be confidence intervals on the SMR by dividing by the overall rate
# The london data set and the sp package are loaded
# For the binomial statistics function (called at top of routine)
# library(epitools)
# Get CI from binomial distribution
flu_ci <- binom.exact(london$Flu_OBS, london$TOTAL_POP)
# Add borough names
flu_ci$NAME <- london$NAME
# Calculate London rate, then compute SMR
r <- sum(london$Flu_OBS) / sum(london$TOTAL_POP)
flu_ci$SMR <- flu_ci$proportion / r
# Subset the high SMR data
flu_high <- flu_ci[flu_ci$SMR > 1, ]
# Plot estimates with CIs
ggplot(flu_high, aes(x = NAME, y = proportion / r, ymin = lower / r, ymax = upper / r)) +
geom_pointrange() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Distributions and confidence intervals can be difficult things to present to non-statisticians
# An alternative is to present a probability that a value is over a threshold
# For example, public health teams might be interested in when an SMR has more than doubled, and as a statistician you can give a probability that this has happened
# Then the public health team might decide to go to some alert level when the probability of a doubling of SMR is over 0.95
# Again, the properties of the binomial distribution let you compute this for proportional data
# You can then map these exceedence probabilities for some threshold, and use a sensible color scheme to highlight probabilities close to 1
# The london data set has been loaded, and the expected flu case count, Flu_EXP has been computed
# Probability of a binomial exceeding a multiple
binom.exceed <- function(observed, population, expected, e){
1 - pbinom(e * expected, population, prob = observed / population)
}
# Compute P(rate > 2)
london$Flu_gt_2 <- binom.exceed(
observed = london$Flu_OBS,
population = london$TOTAL_POP,
expected = london$Flu_EXP,
e = 2)
# Use a 50-color palette that only starts changing at around 0.9
pal <- c(
rep("#B0D0B0", 40),
colorRampPalette(c("#B0D0B0", "orange"))(5),
colorRampPalette(c("orange", "red"))(5)
)
# Plot the P(rate > 2) map
sp::spplot(london, "Flu_gt_2", col.regions = pal, at = seq(0, 1, len = 50))
# A Poisson generalized linear model is a way of fitting count data to explanatory variables
# You get out parameter estimates and standard errors for your explanatory variables, and can get fitted values and residuals
# The glm() function fits Poisson GLMs. It works just like the lm() function, but you also specify a family argument
# The formula has the usual meaning - response on the left of the ~, and explanatory variables on the right
# To cope with count data coming from populations of different sizes, you specify an offset argument
# This adds a constant term for each row of the data in the model. The log of the population is used in the offset term
# The london health data set has been loaded with the sp package also ready.
# Run a Poisson generalized linear model of how the count of flu cases, Flu_OBS, depends on the Health Deprivation index value, HealthDeprivation
# The first argument is the formula (response vairable on the left)
# The family argument is a function, poisson
# Fit a poisson GLM.
model_flu <- glm(
Flu_OBS ~ HealthDeprivation,
offset = log(TOTAL_POP),
data = london,
family = "poisson")
# Is HealthDeprivation significant?
summary(model_flu)
##
## Call:
## glm(formula = Flu_OBS ~ HealthDeprivation, family = "poisson",
## data = london, offset = log(TOTAL_POP))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -9.5361 -4.5285 -0.0499 2.9043 8.2194
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.78190 0.02869 -306.043 <2e-16 ***
## HealthDeprivation 0.65689 0.06797 9.665 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 703.75 on 31 degrees of freedom
## Residual deviance: 605.03 on 30 degrees of freedom
## AIC: 762.37
##
## Number of Fisher Scoring iterations: 5
# Put residuals into the spatial data.
london$Flu_Resid <- residuals(model_flu)
# Map the residuals using spplot
sp::spplot(london, "Flu_Resid")
# A linear model should fit the data and leave uncorrelated residuals
# This applies to non-spatial models, where, for example, fitting a straight line through points on a curve would lead to serially-correlated residuals
# A model on spatial data should aim to have residuals that show no significant spatial correlation
# You can test the model fitted to the flu data using moran.mc() from the spdep package
# Monte Carlo Moran tests were previously discussed in the Spatial autocorrelation test exercise earlier in the chapter
# Compute the neighborhood structure.
borough_nb <- poly2nb(london)
# Test spatial correlation of the residuals.
moran.mc(london$Flu_Resid, listw = nb2listw(borough_nb), nsim = 999)
##
## Monte-Carlo simulation of Moran I
##
## data: london$Flu_Resid
## weights: nb2listw(borough_nb)
## number of simulations + 1: 1000
##
## statistic = 0.15059, observed rank = 925, p-value = 0.075
## alternative hypothesis: greater
# Bayesian statistical models return samples of the parameters of interest (the "posterior" distribution) based on some "prior" distribution which is then updated by the data
# The Bayesian modelling process returns a number of samples from which you can compute the mean, or an exceedence probability, or any other quantity you might compute from a distribution
# Before you fit a model with spatial correlation, you'll first fit the same model as before, but using Bayesian inference
# The london data set has been loaded
# The R2BayesX package provides an interface to the BayesX code.
# The syntax for bayesx() is similar, but the offset has to be specified explicitly from the data frame, the family name is in quotes, and the spatial data frame needs to be turned into a plain data frame
# Run the model fitting and inspect with summary()
# Plot the samples from the Bayesian model
# On the left is the "trace" of samples in sequential order, and on the right is the parameter density
# For this model there is an intercept and a slope for the Health Deprivation score
# The parameter density should correspond with the parameter summary
# Use R2BayesX (called at top of routine)
# library(R2BayesX)
# Fit a GLM
model_flu <- glm(Flu_OBS ~ HealthDeprivation, offset = log(TOTAL_POP),
data = london, family = poisson)
# Summarize it
summary(model_flu)
##
## Call:
## glm(formula = Flu_OBS ~ HealthDeprivation, family = poisson,
## data = london, offset = log(TOTAL_POP))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -9.5361 -4.5285 -0.0499 2.9043 8.2194
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.78190 0.02869 -306.043 <2e-16 ***
## HealthDeprivation 0.65689 0.06797 9.665 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 703.75 on 31 degrees of freedom
## Residual deviance: 605.03 on 30 degrees of freedom
## AIC: 762.37
##
## Number of Fisher Scoring iterations: 5
# Calculate coeff confidence intervals
confint(model_flu)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -8.838677 -8.7261843
## HealthDeprivation 0.524437 0.7908841
# Fit a Bayesian GLM
bayes_flu <- bayesx(Flu_OBS ~ HealthDeprivation, offset = log(london$TOTAL_POP),
family = "poisson", data = as.data.frame(london),
control = bayesx.control(seed = 17610407))
# Summarize it
summary(bayes_flu)
## Call:
## bayesx(formula = Flu_OBS ~ HealthDeprivation, data = as.data.frame(london),
## offset = log(london$TOTAL_POP), control = bayesx.control(seed = 17610407),
## family = "poisson")
##
## Fixed effects estimation results:
##
## Parametric coefficients:
## Mean Sd 2.5% 50% 97.5%
## (Intercept) -8.7831 0.0278 -8.8371 -8.7841 -8.7263
## HealthDeprivation 0.6592 0.0659 0.5345 0.6587 0.7900
##
## N = 32 burnin = 2000 method = MCMC family = poisson
## iterations = 12000 step = 10
# Look at the samples from the Bayesian model
plot(samples(bayes_flu))
# You've fitted a non-spatial GLM with BayesX
# You can include a spatially correlated term based on the adjacency structure by adding a term to the formula specifying a spatially correlated model
# Use poly2nb() to compute the neighborhood structure of london to an nb object
# R2BayesX uses its own objects for the adjacency. Convert the nb object to a gra object
# The sx function specifies additional terms to bayesx. Create a term using the "spatial" basis and the gra object for the boroughs to define the map
# Print a summary of the model object. You should see a table of coefficients for the parametric part of the model as in the previous exercise, and then a table of "Smooth terms variance" with one row for the spatial term
# Note that since BayesX can fit many different forms in its sx terms, most of which, like the spatial model here, cannot be simply expressed with a parameter or two
# This table shows the variance of the random effects - for further explanation consult a text on random effects modelling
# Compute adjacency objects
borough_nb <- poly2nb(london)
borough_gra <- nb2gra(borough_nb)
# Fit spatial model
flu_spatial <- bayesx(
Flu_OBS ~ HealthDeprivation + sx(i, bs = "spatial", map = borough_gra),
offset = log(london$TOTAL_POP),
family = "poisson", data = data.frame(london),
control = bayesx.control(seed = 17610407)
)
## Note: created new output directory 'C:/Users/Dave/AppData/Local/Temp/Rtmpa43JUL/bayesx1'!
# Summarize the model
summary(flu_spatial)
## Call:
## bayesx(formula = Flu_OBS ~ HealthDeprivation + sx(i, bs = "spatial",
## map = borough_gra), data = data.frame(london), offset = log(london$TOTAL_POP),
## control = bayesx.control(seed = 17610407), family = "poisson")
##
## Fixed effects estimation results:
##
## Parametric coefficients:
## Mean Sd 2.5% 50% 97.5%
## (Intercept) -9.2311 0.1246 -9.4826 -9.2298 -9.0148
## HealthDeprivation 0.7683 0.5844 -0.3749 0.7775 1.7922
##
## Smooth terms variances:
## Mean Sd 2.5% 50% 97.5% Min Max
## sx(i):mrf 4.6381 1.6822 2.2851 4.3510 8.8104 1.6557 16.266
##
## N = 32 burnin = 2000 method = MCMC family = poisson
## iterations = 12000 step = 10
# As with glm, you can get the fitted values and residuals from your model using the fitted and residuals functions. bayesx models are a bit more complex, since you have the linear predictor and terms from sx elements, such as the spatially correlated term
# The summary function will show you information for the linear model terms and the smoothing terms in two separate tables
# The spatial term is called "sx(i):mrf" - standing for "Markov Random Field"
# Bayesian analysis returns samples from a distribution for our S(x) term at each of the London boroughs
# The fitted function from bayesx models returns summary statistics for each borough
# You'll just look at the mean of that distribution for now
# The model from the BayesX output is available as flu_spatial.
# Summarise the model
summary(flu_spatial)
## Call:
## bayesx(formula = Flu_OBS ~ HealthDeprivation + sx(i, bs = "spatial",
## map = borough_gra), data = data.frame(london), offset = log(london$TOTAL_POP),
## control = bayesx.control(seed = 17610407), family = "poisson")
##
## Fixed effects estimation results:
##
## Parametric coefficients:
## Mean Sd 2.5% 50% 97.5%
## (Intercept) -9.2311 0.1246 -9.4826 -9.2298 -9.0148
## HealthDeprivation 0.7683 0.5844 -0.3749 0.7775 1.7922
##
## Smooth terms variances:
## Mean Sd 2.5% 50% 97.5% Min Max
## sx(i):mrf 4.6381 1.6822 2.2851 4.3510 8.8104 1.6557 16.266
##
## N = 32 burnin = 2000 method = MCMC family = poisson
## iterations = 12000 step = 10
# Map the fitted spatial term only
london$spatial <- fitted(flu_spatial, term = "sx(i):mrf")[, "Mean"]
sp::spplot(london, zcol = "spatial")
# Map the residuals
london$spatial_resid <- residuals(flu_spatial)[, "mu"]
sp::spplot(london, zcol = "spatial_resid")
# Test residuals for spatial correlation
moran.mc(london$spatial_resid, nb2listw(borough_nb), 999)
##
## Monte-Carlo simulation of Moran I
##
## data: london$spatial_resid
## weights: nb2listw(borough_nb)
## number of simulations + 1: 1000
##
## statistic = -0.26922, observed rank = 16, p-value = 0.984
## alternative hypothesis: greater
Chapter 4 - Geostatistics
Geostatistical data:
Variogram:
Kriging predictions:
Automatic kriging:
Wrap up:
Example code includes:
# Your job is to study the acidity (pH) of some Canadian survey data. The survey measurements are loaded into a spatial data object called ca_geo
# ca_geo has been pre-defined
ca_geo <- readRDS("./RInputFiles/ca_geo.RDS")
summary(ca_geo)
## Object of class SpatialPointsDataFrame
## Coordinates:
## min max
## x 542608.7 714269.2
## y 5541290.4 5652558.9
## Is projected: TRUE
## proj4string :
## [+init=epsg:32609 +proj=utm +zone=9 +datum=WGS84 +units=m +no_defs
## +ellps=WGS84 +towgs84=0,0,0]
## Number of points: 1140
## Data attributes:
## ID Elev pH Zn
## 102I881003: 1 Min. : 5.0 Min. :3.900 Min. : 1.00
## 102I881004: 1 1st Qu.: 20.0 1st Qu.:6.100 1st Qu.: 40.00
## 102I881005: 1 Median :110.0 Median :6.600 Median : 62.00
## 102I881006: 1 Mean :183.6 Mean :6.531 Mean : 72.34
## 102I881007: 1 3rd Qu.:310.0 3rd Qu.:7.000 3rd Qu.: 88.00
## 102I881008: 1 Max. :914.0 Max. :8.700 Max. :510.00
## (Other) :1134 NA's :9 NA's :33
## Cu Pb Ni Co
## Min. : 1.00 Min. : 1.000 Min. : 1.00 Min. : 1.00
## 1st Qu.: 21.00 1st Qu.: 1.000 1st Qu.: 7.00 1st Qu.:11.00
## Median : 37.00 Median : 1.000 Median : 20.00 Median :19.00
## Mean : 57.45 Mean : 2.975 Mean : 27.85 Mean :20.16
## 3rd Qu.: 76.00 3rd Qu.: 3.000 3rd Qu.: 37.00 3rd Qu.:27.00
## Max. :2950.00 Max. :195.000 Max. :340.00 Max. :77.00
##
## Ag Mn Fe Mo
## Min. :0.1000 Min. : 2.0 Min. : 0.010 Min. : 1.000
## 1st Qu.:0.1000 1st Qu.: 490.0 1st Qu.: 4.000 1st Qu.: 1.000
## Median :0.1000 Median : 820.0 Median : 5.100 Median : 1.000
## Mean :0.1146 Mean : 959.5 Mean : 5.168 Mean : 1.654
## 3rd Qu.:0.1000 3rd Qu.:1200.0 3rd Qu.: 6.200 3rd Qu.: 2.000
## Max. :7.9000 Max. :9700.0 Max. :24.000 Max. :46.000
##
## U W Sn Hg
## Min. :-1.00 Min. :-1.00 Min. : 1.000 Min. : 5
## 1st Qu.: 0.70 1st Qu.: 1.00 1st Qu.: 1.000 1st Qu.: 60
## Median : 1.10 Median : 1.00 Median : 1.000 Median : 80
## Mean : 1.36 Mean : 1.14 Mean : 1.123 Mean : 232
## 3rd Qu.: 1.70 3rd Qu.: 1.00 3rd Qu.: 1.000 3rd Qu.: 120
## Max. : 9.10 Max. :32.00 Max. :140.000 Max. :20000
##
## As Sb Ba Cd
## Min. : 1.00 Min. : 0.1000 Min. : 5 Min. : 0.100
## 1st Qu.: 5.00 1st Qu.: 0.1000 1st Qu.: 200 1st Qu.: 0.100
## Median : 6.00 Median : 0.1000 Median : 300 Median : 0.100
## Mean : 10.95 Mean : 0.2411 Mean : 301 Mean : 0.165
## 3rd Qu.: 10.00 3rd Qu.: 0.1000 3rd Qu.: 390 3rd Qu.: 0.100
## Max. :360.00 Max. :15.0000 Max. :1800 Max. :14.800
##
## V Bi Cr LoI
## Min. : 2.0 Min. :0.1000 Min. : 5.0 Min. :-1.00
## 1st Qu.:215.0 1st Qu.:0.1000 1st Qu.: 52.0 1st Qu.: 6.20
## Median :295.0 Median :0.1000 Median : 88.0 Median : 9.20
## Mean :318.3 Mean :0.1213 Mean :114.1 Mean :11.45
## 3rd Qu.:410.0 3rd Qu.:0.1000 3rd Qu.:148.0 3rd Qu.:14.00
## Max. :960.0 Max. :4.2000 Max. :860.0 Max. :82.80
##
## F Au
## Min. : 20.0 Min. : 1.00
## 1st Qu.:120.0 1st Qu.: 1.00
## Median :150.0 Median : 2.00
## Mean :164.2 Mean : 24.55
## 3rd Qu.:200.0 3rd Qu.: 5.00
## Max. :620.0 Max. :5800.00
##
str(ca_geo, 1)
## Formal class 'SpatialPointsDataFrame' [package "sp"] with 5 slots
# See what measurements are at each location
names(ca_geo)
## [1] "ID" "Elev" "pH" "Zn" "Cu" "Pb" "Ni" "Co" "Ag" "Mn"
## [11] "Fe" "Mo" "U" "W" "Sn" "Hg" "As" "Sb" "Ba" "Cd"
## [21] "V" "Bi" "Cr" "LoI" "F" "Au"
# Get a summary of the acidity (pH) values
summary(ca_geo$pH)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 3.900 6.100 6.600 6.531 7.000 8.700 33
# Look at the distribution
hist(ca_geo$pH)
# Make a vector that is TRUE for the missing data
miss <- is.na(ca_geo$pH)
table(miss)
## miss
## FALSE TRUE
## 1107 33
# Plot a map of acidity
spplot(ca_geo[!miss, ], "pH")
# The acidity data shows pH broadly increasing from north-east to south-west. Fitting a linear model with the coordinates as covariates will interpolate a flat plane through the values
# ca_geo has been pre-defined
str(ca_geo, 1)
## Formal class 'SpatialPointsDataFrame' [package "sp"] with 5 slots
# Are they called lat-long, up-down, or what?
coordnames(ca_geo)
## [1] "x" "y"
# Complete the formula
m_trend <- lm(pH ~ x + y, as.data.frame(ca_geo))
# Check the coefficients
summary(m_trend)
##
## Call:
## lm(formula = pH ~ x + y, data = as.data.frame(ca_geo))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.83561 -0.32091 -0.00761 0.33188 2.06249
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.358e+01 3.002e+00 27.84 <2e-16 ***
## x -5.691e-06 3.483e-07 -16.34 <2e-16 ***
## y -1.313e-05 5.319e-07 -24.68 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5299 on 1104 degrees of freedom
## (33 observations deleted due to missingness)
## Multiple R-squared: 0.4237, Adjusted R-squared: 0.4227
## F-statistic: 405.9 on 2 and 1104 DF, p-value: < 2.2e-16
# Your next task is to compute the pH at the locations that have missing data in the source. You can use the predict() function on the fitted model from the previous exercise for this
# ca_geo, miss, m_trend have been pre-defined
# ls.str()
# Make a vector that is TRUE for the missing data
miss <- is.na(ca_geo$pH)
# Create a data frame of missing data
ca_geo_miss <- as.data.frame(ca_geo)[miss, ]
# Predict pH for the missing data
predictions <- predict(m_trend, newdata = ca_geo_miss, se.fit = TRUE)
# Compute the exceedence probability
pAlkaline <- 1 - pnorm(7, mean = predictions$fit, sd = predictions$se.fit)
hist(pAlkaline)
# You can use the gstat package to plot variogram clouds and the variograms from data. Recall:
# The variogram cloud shows the differences of the measurements against distance for all pairs of data points
# The binned variogram divides the cloud into distance bins and computes the average difference within each bin
# The y-range of the binned variogram is always much smaller than the variogram cloud because the cloud includes the full range of values that go into computing the mean for the binned variogram
# The acidity survey data, ca_geo and the missing value index, miss have been pre-defined
# The gstat variogram() function uses the cloud argument to plot a variogram cloud - the default cloud parameter is FALSE
# ca_geo, miss have been pre-defined
# ls.str()
# Make a cloud from the non-missing data up to 10km
plot(gstat::variogram(pH ~ 1, ca_geo[!miss, ], cloud = TRUE, cutoff = 10000))
# Make a variogram of the non-missing data
plot(gstat::variogram(pH ~ 1, ca_geo[!miss, ]))
# You might imagine that if soil at a particular point is alkaline, then soil one metre away is likely to be alkaline too
# But can you say the same thing about soil one kilometre away, or ten kilometres, or one hundred kilometres?
# The shape of the previous variogram tells you there is a large-scale trend in the data
# You can fit a variogram considering this trend with gstat
# This variogram should flatten out, indicating there is no more spatial correlation after a certain distance with the trend taken into account
# ca_geo, miss have been pre-defined
# ls.str()
# See what coordinates are called
coordnames(ca_geo)
## [1] "x" "y"
# The pH depends on the coordinates
ph_vgm <- gstat::variogram(pH ~ x + y, ca_geo[!miss, ])
plot(ph_vgm)
# Next you'll fit a model to your variogram
# The gstat function fit.variogram() does this
# You need to give it some initial values as a starting point for the optimization algorithm to fit a better model
# The sill is the the upper limit of the model
# That is, the long-range largest value, ignoring any outliers
# A variogram has been plotted for you, and ph_vgm has been pre-defined
# Estimate some parameters by eyeballing the plot
# The nugget is the value of the semivariance at zero distance.
# The partial sill, psill is the difference between the sill and the nugget.
# Set the range to the distance at which the variogram has got about half way between the nugget and the sill
# Fit a variogram model by calling fit.variogram()
# The second argument should take the parameters you estimated, wrapped in a call to vgm()
# ca_geo, miss, ph_vgm have been pre-defined
# ls.str()
# Eyeball the variogram and estimate the initial parameters
nugget <- 0.16
psill <- 0.15
range <- 10000
# Fit the variogram
v_model <- gstat::fit.variogram(
ph_vgm,
model = gstat::vgm(
model = "Ste",
nugget = nugget,
psill = psill,
range = range,
kappa = 0.5
)
)
# Show the fitted variogram on top of the binned variogram
plot(ph_vgm, model = v_model)
print(v_model)
## model psill range kappa
## 1 Nug 0.1545116 0.00 0.0
## 2 Ste 0.1442007 14379.29 0.5
# The final part of geostatical estimation is kriging itself
# This is the application of the variogram along with the sample data points to produce estimates and uncertainties at new locations
# The computation of estimates and uncertainties, together with the assumption of a normal (Gaussian) response means you can compute any function of the estimates - for example the probability of a new location having alkaline soil
# The acidity survey data, ca_geo, the missing value index, miss, and the variogram model, v_model, have been pre-defined
# ca_geo, miss, v_model have been pre-defined
# ls.str()
# Set the trend formula and the new data
km <- gstat::krige(pH ~ x + y, ca_geo[!miss, ], newdata = ca_geo[miss, ], model = v_model)
## [using universal kriging]
names(km)
## [1] "var1.pred" "var1.var"
# Plot the predicted values
spplot(km, "var1.pred")
# Compute the probability of alkaline samples, and map
km$pAlkaline <- 1 - pnorm(7, mean = km$var1.pred, sd = sqrt(km$var1.var))
spplot(km, "pAlkaline")
# You have been asked to produce an alkaline probability map over the study area
# To do this, you are going to do some kriging via the krige() function
# This requires a SpatialPixels object which will take a bit of data manipulation to create
# You start by defining a grid, creating points on that grid, cropping to the study region, and then finally converting to SpatialPixels
# On the way, you'll meet some new functions
# GridTopology() defines a rectangular grid. It takes three vectors of length two as inputs
# The first specifies the position of the bottom left corner of the grid
# The second specifies the width and height of each rectangle in the grid, and the third specifies the number of rectangles in each direction
# To ensure that the grid and the study area have the same coordinates, some housekeeping is involved
# SpatialPoints() converts the points to a coordinate reference system (CRS), or projection (different packages use different terminology for the same concept)
# The CRS is created by wrapping the study area in projection(), then in CRS()
# For the purpose of this exercise, you don't need to worry about exactly what these functions do, only that this data manipulation is necessary to align the grid and the study area
# Now that you have that alignment, crop(), as the name suggests, crops the grid to the study area
# Finally, SpatialPixels() converts the raster cropped gridpoints to the equivalent sp object
# The acidity survey data, ca_geo, the missing value index, miss, the variogram, vgm, and the variogram model, v_model, have been pre-defined
# A rough outline of the study area is in an object called geo_bounds
# ca_geo, geo_bounds have been pre-defined
# ls.str()
# Plot the polygon and points
geo_bounds <- readRDS("./RInputFiles/ca_geo_bounds.RDS")
plot(geo_bounds)
points(ca_geo)
# Find the corners of the boundary
bbox(geo_bounds)
## min max
## x 537853.4 719269.2
## y 5536290.4 5657400.9
# Define a 2.5km square grid over the polygon extent. The first parameter is
# the bottom left corner.
grid <- GridTopology(c(537853, 5536290), c(2500, 2500), c(72, 48))
# Create points with the same coordinate system as the boundary
gridpoints <- SpatialPoints(grid, proj4string = CRS(raster::projection(geo_bounds)))
plot(gridpoints)
# Crop out the points outside the boundary
cropped_gridpoints <- raster::crop(gridpoints, geo_bounds)
plot(cropped_gridpoints)
# Convert to SpatialPixels
spgrid <- SpatialPixels(cropped_gridpoints)
coordnames(spgrid) <- c("x", "y")
plot(spgrid)
# The spatial pixel grid of the region, spgrid, and the variogram model of pH, v_model have been pre-defined
# spgrid, v_model have been pre-defined
# ls.str()
# Do kriging predictions over the grid
ph_grid <- gstat::krige(pH ~ x + y, ca_geo[!miss, ], newdata = spgrid, model = v_model)
## [using universal kriging]
# Calc the probability of pH exceeding 7
ph_grid$pAlkaline <- 1 - pnorm(7, mean = ph_grid$var1.pred, sd = sqrt(ph_grid$var1.var))
# Map the probability of alkaline samples
spplot(ph_grid, zcol = "pAlkaline")
# The autoKrige() function in the automap package computes binned variograms, fits models, does model selection, and performs kriging by making multiple calls to the gstat functions you used previously
# It can be a great time-saver but you should always check the results carefully.
# autoKrige() can try several variogram model types
# In the example, you'll use a Matern variogram model, which is commonly used in soil and forestry analyses
# You can see a complete list of available models by calling vgm() with no arguments
# The acidity survey data, ca_geo, and the missing value index, miss, have been pre-defined
# ca_geo, miss are pre-defined
# ls.str()
# Kriging with linear trend, predicting over the missing points
ph_auto <- automap::autoKrige(
pH ~ x + y,
input_data = ca_geo[!miss, ],
new_data = ca_geo[miss, ],
model = "Mat"
)
## [using universal kriging]
# Plot the variogram, predictions, and standard error
plot(ph_auto)
# You can also use autoKrige() over the spgrid grid from the earlier exercise
# This brings together all the concepts that you've learned in the chapter
# That is, kriging is great for predicting missing data, plotting things on a grid is much clearer than plotting individual points, and automatic kriging is less hassle than manual kriging
# The acidity survey data, ca_geo, the missing value index, miss, the spatial pixel grid of the region, spgrid, the manual kriging grid model, ph_grid, and the variogram model of pH, v_model have been pre-defined
# ca_geo, miss, spgrid, ph_grid, v_model are pre-defined
# ls.str()
# Auto-run the kriging
ph_auto_grid <- automap::autoKrige(pH ~ x + y, input_data = ca_geo[!miss, ], new_data = spgrid)
## [using universal kriging]
# Remember predictions from manual kriging
plot(ph_grid)
# Plot predictions and variogram fit
plot(ph_auto_grid)
# Compare the variogram model to the earlier one
v_model
## model psill range kappa
## 1 Nug 0.1545116 0.00 0.0
## 2 Ste 0.1442007 14379.29 0.5
ph_auto_grid$var_model
## model psill range kappa
## 1 Nug 0.1846444 0.00 0
## 2 Ste 0.1092281 13085.13 2
Chapter 1 - Vector and Raster Spatial Data in R
Reading vector and raster data into R:
Getting to know your vector data:
Getting to know your raster data:
Example code includes:
# Load the sf package
library(sf)
## Linking to GEOS 3.6.1, GDAL 2.2.0, proj.4 4.9.3
# Read in the trees shapefile
trees <- st_read("./RInputFiles/ZIP Files/trees/trees.shp")
## Reading layer `trees' from data source `C:\Users\Dave\Documents\Personal\Learning\Coursera\RDirectory\RHomework\DataCamp\RInputFiles\ZIP Files\trees\trees.shp' using driver `ESRI Shapefile'
## Simple feature collection with 65217 features and 7 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: -74.2546 ymin: 40.49894 xmax: -73.70078 ymax: 40.91165
## epsg (SRID): 4326
## proj4string: +proj=longlat +ellps=WGS84 +no_defs
# Read in the neighborhood shapefile
neighborhoods <- st_read("./RInputFiles/ZIP Files/neighborhoods/neighborhoods.shp")
## Reading layer `neighborhoods' from data source `C:\Users\Dave\Documents\Personal\Learning\Coursera\RDirectory\RHomework\DataCamp\RInputFiles\ZIP Files\neighborhoods\neighborhoods.shp' using driver `ESRI Shapefile'
## Simple feature collection with 195 features and 5 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -74.25559 ymin: 40.49612 xmax: -73.70001 ymax: 40.91553
## epsg (SRID): 4326
## proj4string: +proj=longlat +ellps=WGS84 +no_defs
# Read in the parks shapefile
parks <- st_read("./RInputFiles/ZIP Files/parks/parks.shp")
## Reading layer `parks' from data source `C:\Users\Dave\Documents\Personal\Learning\Coursera\RDirectory\RHomework\DataCamp\RInputFiles\ZIP Files\parks\parks.shp' using driver `ESRI Shapefile'
## Simple feature collection with 2008 features and 14 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -74.25609 ymin: 40.49449 xmax: -73.70905 ymax: 40.91133
## epsg (SRID): 4326
## proj4string: +proj=longlat +datum=WGS84 +no_defs
# View the first few trees
head(trees)
## Simple feature collection with 6 features and 7 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: -74.13116 ymin: 40.62351 xmax: -73.80057 ymax: 40.77393
## epsg (SRID): 4326
## proj4string: +proj=longlat +ellps=WGS84 +no_defs
## tree_id nta longitude latitude stump_diam species boroname
## 1 558423 QN76 -73.80057 40.67035 0 honeylocust Queens
## 2 286191 MN32 -73.95422 40.77393 0 Callery pear Manhattan
## 3 257044 QN70 -73.92309 40.76196 0 Chinese elm Queens
## 4 603262 BK09 -73.99866 40.69312 0 cherry Brooklyn
## 5 41769 SI22 -74.11773 40.63166 0 cherry Staten Island
## 6 24024 SI07 -74.13116 40.62351 0 red maple Staten Island
## geometry
## 1 POINT (-73.80057 40.67035)
## 2 POINT (-73.95422 40.77393)
## 3 POINT (-73.92309 40.76196)
## 4 POINT (-73.99866 40.69312)
## 5 POINT (-74.11773 40.63166)
## 6 POINT (-74.13116 40.62351)
# The term "raster" refers to gridded data that can include satellite imagery, aerial photographs (like orthophotos) and other types
# In R, raster data can be handled using the raster package created by Robert J. Hijmans
# When working with raster data, one of the most important things to keep in mind is that the raw data can be what is known as "single-band" or "multi-band" and these are handled a little differently in R
# Single-band rasters are the simplest, these have a single layer of raster values -- a classic example would be an elevation raster where each cell value represents the elevation at that location
# Multi-band rasters will have more than one layer. An example is a color aerial photo in which there would be one band each representing red, green or blue light.
# Load the raster package
library(raster)
## Loading required package: sp
##
## Attaching package: 'raster'
## The following objects are masked from 'package:spatstat':
##
## area, rotate, shift
## The following object is masked from 'package:nlme':
##
## getData
## The following object is masked from 'package:dplyr':
##
## select
# Read in the tree canopy single-band raster
canopy <- raster("./RInputFiles/ZIP Files/canopy/canopy.tif")
# Read in the manhattan Landsat image multi-band raster
manhattan <- brick("./RInputFiles/ZIP Files/manhattan/manhattan.tif")
# Get the class for the new objects
class(canopy)
## [1] "RasterLayer"
## attr(,"package")
## [1] "raster"
class(manhattan)
## [1] "RasterBrick"
## attr(,"package")
## [1] "raster"
# Identify how many layers each object has
nlayers(canopy)
## [1] 1
nlayers(manhattan)
## [1] 3
# As mentioned in the video, spatial objects in sf are just data frames with some special properties
# This means that packages like dplyr can be used to manipulate sf objects
# In this exercise, you will use the dplyr functions select() to select or drop variables, filter() to filter the data and mutate() to add or alter columns
# Load the dplyr and sf packages
# library(dplyr)
# library(sf)
# Read in the trees shapefile (already read in above)
# trees <- st_read("trees.shp")
# Use filter() to limit to honey locust trees
honeylocust <- trees %>% filter(species == "honeylocust")
# Count the number of rows
nrow(honeylocust)
## [1] 6418
# Limit to tree_id and boroname variables
honeylocust_lim <- honeylocust %>% dplyr::select(tree_id, boroname)
# Use head() to look at the first few records
head(honeylocust_lim)
## Simple feature collection with 6 features and 2 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: -73.97524 ymin: 40.67035 xmax: -73.80057 ymax: 40.83136
## epsg (SRID): 4326
## proj4string: +proj=longlat +ellps=WGS84 +no_defs
## tree_id boroname geometry
## 1 558423 Queens POINT (-73.80057 40.67035)
## 2 548625 Brooklyn POINT (-73.97524 40.68575)
## 3 549439 Brooklyn POINT (-73.94137 40.67557)
## 4 181757 Brooklyn POINT (-73.89377 40.67169)
## 5 379387 Queens POINT (-73.8221 40.69365)
## 6 383562 Bronx POINT (-73.90041 40.83136)
# In this exercise, you will convert the data frame to what's called a tibble with tibble::as_tibble() (Note that dplyr::tbl_df() is now deprecated)
# tibble is loaded in your workspace
# Create a standard, non-spatial data frame with one column
df <- data.frame(a = 1:3)
# Add a list column to your data frame
df$b <- list(1:4, 1:5, 1:10)
# Look at your data frame with head
head(df)
## a b
## 1 1 1, 2, 3, 4
## 2 2 1, 2, 3, 4, 5
## 3 3 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
# Convert your data frame to a tibble and print on console
as_tibble(df)
## # A tibble: 3 x 2
## a b
## <int> <list>
## 1 1 <int [4]>
## 2 2 <int [5]>
## 3 3 <int [10]>
# Pull out the third observation from both columns individually
df$a[3]
## [1] 3
df$b[3]
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9 10
# There are several functions in sf that allow you to access geometric information like area from your vector features
# For example, the functions st_area() and st_length() return the area and length of your features, respectively
# Note that the result of functions like st_area() and st_length() will not be a traditional vector
# Instead the result has a class of units which means the vector result is accompanied by metadata describing the object's units
# you need to either remove the units with unclass()
# or you need to convert val's class to units such as with units(val) <- units(result)
# sf and dplyr are loaded in your workspace
# Read in the parks shapefile (already read in above)
# parks <- st_read("parks.shp")
# Compute the areas of the parks
areas <- st_area(parks)
# Create a quick histogram of the areas using hist
hist(areas, xlim = c(0, 200000), breaks = 1000)
# Filter to parks greater than 30000 (square meters)
big_parks <- parks %>% filter(unclass(areas) > 30000)
# Plot just the geometry of big_parks
plot(st_geometry(big_parks))
# Plot the parks object using all defaults
plot(parks)
## Warning: plotting the first 9 out of 14 attributes; use max.plot = 14 to
## plot all
# Plot just the acres attribute of the parks data
plot(parks["acres"])
# Create a new object of just the parks geometry
parks_geo <- st_geometry(parks)
# Plot the geometry of the parks data
plot(parks_geo)
# Load the raster package
# library(raster)
# Read in the rasters (done previously)
# canopy <- raster("canopy.tif")
# manhattan <- brick("manhattan.tif")
# Get the extent of the canopy object
extent(canopy)
## class : Extent
## xmin : 1793685
## xmax : 1869585
## ymin : 2141805
## ymax : 2210805
# Get the CRS of the manhattan object
crs(manhattan)
## CRS arguments:
## +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs +ellps=WGS84
## +towgs84=0,0,0
# Determine the number of grid cells in both raster objects
ncell(manhattan)
## [1] 619173
ncell(canopy)
## [1] 58190
# Raster data can be very big depending on the extent and resolution (grid size)
# In order to deal with this the raster() and brick() functions are designed to only read in the actual raster values as needed
# To show that this is true, you can use the inMemory() function on an object and it will return FALSE if the values are not in memory
# If you use the head() function, the raster package will read in only the values needed, not the full set of values
# The raster values will be read in by default if you perform spatial analysis operations that require it or you can read in the values from a raster manually with the function getValues()
graphics.off()
# Check if the data is in memory
inMemory(canopy)
## [1] FALSE
# Use head() to peak at the first few records
head(canopy)
## 1 2 3 4 5 6 7 8 9 10 11 12
## 1 0.00 19.35 47.88 17.17 54.27 70.93 81.18 84.23 88.86 87.17 82.27 81.65
## 2 0.00 10.65 58.61 28.77 51.19 53.65 85.29 88.81 89.00 84.59 79.00 87.18
## 3 0.00 0.00 17.26 28.72 49.04 43.84 76.13 83.78 88.30 76.47 84.44 69.35
## 4 0.00 0.96 23.81 64.48 38.24 36.16 79.26 87.02 83.80 70.21 34.33 16.14
## 5 0.00 6.97 38.18 81.83 48.02 52.84 71.18 87.21 76.72 72.90 7.12 47.38
## 6 14.89 31.42 34.17 51.29 85.26 70.05 74.99 83.52 83.98 75.74 41.93 84.72
## 7 65.06 37.87 30.91 23.92 35.21 53.85 85.32 85.59 85.63 76.34 77.72 81.97
## 8 68.95 43.38 37.51 22.02 27.54 72.25 84.80 86.20 86.14 84.44 82.56 67.32
## 9 58.23 33.00 43.03 12.07 19.64 76.00 76.35 76.53 83.94 85.48 83.76 41.86
## 10 46.31 53.63 23.67 10.73 48.16 60.86 63.47 69.98 61.79 55.78 34.47 49.32
## 13 14 15 16 17 18 19 20
## 1 77.95 80.72 64.98 80.05 66.04 34.45 28.03 54.67
## 2 73.62 84.57 72.14 83.08 72.75 13.29 33.37 42.57
## 3 66.02 83.02 72.60 77.81 60.89 49.96 27.29 41.00
## 4 72.36 82.51 72.86 78.84 70.87 30.37 47.14 55.99
## 5 75.38 87.18 74.90 81.76 60.04 24.07 45.37 52.69
## 6 85.36 84.52 65.53 68.28 59.58 52.38 31.21 33.98
## 7 65.10 63.87 48.16 30.90 39.24 29.74 5.78 14.39
## 8 6.58 51.03 21.22 40.49 29.76 22.16 6.76 46.05
## 9 9.47 23.15 50.56 44.56 19.28 32.92 52.94 43.42
## 10 8.55 21.59 55.36 54.16 45.76 47.89 59.27 47.48
# Use getValues() to read the values into a vector
vals <- getValues(canopy)
# Use hist() to create a histogram of the values
hist(vals)
# The raster package has added useful methods for plotting both single and multi-band rasters
# For single-band rasters or for a map of each layer in a multi-band raster you can simply use plot()
# If you have a multi-band raster with layers for red, green and blue light you can use the plotRGB() function to plot the raster layers together as a single image
# Plot the canopy raster (single raster)
plot(canopy)
# Plot the manhattan raster (as a single image for each layer)
plot(manhattan)
# Plot the manhattan raster as an image
plotRGB(manhattan)
# raster masks dplyr::select
detach("package:raster")
Chapter 2 - Preparing Layers for Spatial Analysis
Quick refresher on coordinate reference systems (CRS):
Manipulating vector layers with dplyr:
Converting sf objects into sp objects and coordinates:
Manipulating raster layers:
Example code includes:
library(sf)
library(raster)
##
## Attaching package: 'raster'
## The following object is masked from 'package:qdapTools':
##
## shift
## The following object is masked from 'package:qdapRegex':
##
## bind
## The following object is masked from 'package:magrittr':
##
## extract
## The following object is masked from 'package:colorspace':
##
## RGB
## The following objects are masked from 'package:spatstat':
##
## area, rotate, shift
## The following object is masked from 'package:nlme':
##
## getData
## The following object is masked from 'package:dplyr':
##
## select
# In order to perform any spatial analysis with more than one layer, your layers should share the same coordinate reference system (CRS) and the first step is determining what coordinate reference system your data has
# To do this you can make use of the sf function st_crs() and the raster function crs()
# When the geographic data you read in with sf already has a CRS defined both sf and raster will recognize and retain it
# When the CRS is not defined you will need to define it yourself using either the EPSG number or the proj4string
# Determine the CRS for the neighborhoods and trees vector objects
st_crs(neighborhoods)
## Coordinate Reference System:
## EPSG: 4326
## proj4string: "+proj=longlat +ellps=WGS84 +no_defs"
st_crs(trees)
## Coordinate Reference System:
## EPSG: 4326
## proj4string: "+proj=longlat +ellps=WGS84 +no_defs"
# Assign the CRS to trees
crs_1 <- "+proj=longlat +ellps=WGS84 +no_defs"
st_crs(trees) <- crs_1
# Determine the CRS for the canopy and manhattan rasters
crs(canopy)
## CRS arguments:
## +proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0
## +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
crs(manhattan)
## CRS arguments:
## +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs +ellps=WGS84
## +towgs84=0,0,0
# Assign the CRS to manhattan
crs_2 <- "+proj=utm +zone=18 +ellps=GRS80 +datum=NAD83 +units=m +no_defs"
crs(manhattan) <- crs_2
# In this exercise you will transform (sometimes this is called "project") the objects so they share a single CRS
# It is generally best to perform spatial analysis with layers that have a projected CRS (and some functions require this)
# To determine if your object has a projected CRS you can look at the first part of the result from st_crs() or crs() -- if it begins with +proj=longlat then your CRS is unprojected
# Note that you will use method = "ngb" in your call to projectRaster() to prevent distortion in the manhattan image
# Get the CRS from the canopy object
the_crs <- crs(canopy, asText = TRUE)
# Project trees to match the CRS of canopy
trees_crs <- st_transform(trees, crs = the_crs)
# Project neighborhoods to match the CRS of canopy
neighborhoods_crs <- st_transform(neighborhoods, crs = the_crs)
# Project manhattan to match the CRS of canopy
manhattan_crs <- projectRaster(manhattan, crs = the_crs, method = "ngb")
# Look at the CRS to see if they match
st_crs(trees_crs)
## Coordinate Reference System:
## No EPSG code
## proj4string: "+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"
st_crs(neighborhoods_crs)
## Coordinate Reference System:
## No EPSG code
## proj4string: "+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"
crs(manhattan_crs)
## CRS arguments:
## +proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0
## +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
# If the layers do not share a common CRS they may not align on a plot
# To illustrate, in this exercise, you will initially create a plot with the plot() function and try to add two layers that do not share the same CRS
# You will then transform one layer's CRS to match the other and you will plot this with both the plot() function and functions from the tmap package.
# Note that for this exercise we returned all the layers to their original CRS and did not retain the changes you made in the last exercise
# With the plot() function you can plot multiple layers on the same map by calling plot() multiple times
# You'll need to add the argument add = TRUE to all calls to plot() after the first one and you need to run the code for all layers at once rather than line-by-line
# Plot canopy and neighborhoods (run both lines together)
# Do you see the neighborhoods?
plot(canopy)
plot(neighborhoods$geometry, add = TRUE)
# See if canopy and neighborhoods share a CRS
st_crs(neighborhoods)
## Coordinate Reference System:
## EPSG: 4326
## proj4string: "+proj=longlat +ellps=WGS84 +no_defs"
crs(canopy)
## CRS arguments:
## +proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0
## +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
# Save the CRS of the canopy layer
the_crs <- crs(canopy, asText = TRUE)
# Transform the neighborhoods CRS to match canopy
neighborhoods_crs <- st_transform(neighborhoods, crs=the_crs)
# Re-run plotting code (run both lines together)
# Do the neighborhoods show up now?
plot(canopy)
plot(neighborhoods_crs$geometry, add = TRUE)
# Simply run the tmap code
tmap::tm_shape(canopy) +
tmap::tm_rgb() +
tmap::tm_shape(neighborhoods_crs) +
tmap::tm_polygons(alpha = 0.5)
# One of the great innovations of sf over sp is the use of data frames for storing spatial objects
# This allows you to slice and dice your spatial data in the same way you do for non-spatial data
# This means you can, for example, apply dplyr verbs directly to your sf object
# One important difference between dplyr with and without spatial data is that the resulting data frames will include the geometry variable unless you explicitly drop it
# If you want to force the geometry to be dropped you would use the sf function st_set_geometry() and you would set the geometry to NULL
# The packages sf and dplyr, and the object trees are loaded in your workspace
# Create a data frame of counts by species
species_counts <- count(trees, species)
# Arrange in descending order
species_counts_desc <- arrange(species_counts, desc(n))
# Use head to see if the geometry column is in the data frame
head(species_counts_desc)
## Simple feature collection with 6 features and 2 fields
## geometry type: MULTIPOINT
## dimension: XY
## bbox: xmin: -74.25443 ymin: 40.49894 xmax: -73.70104 ymax: 40.91165
## epsg (SRID): 4326
## proj4string: +proj=longlat +ellps=WGS84 +no_defs
## # A tibble: 6 x 3
## species n geometry
## <fct> <int> <sf_geometry [degree]>
## 1 London planetree 8709 MULTIPOINT (-74.25408 40.50...
## 2 honeylocust 6418 MULTIPOINT (-74.25426 40.50...
## 3 Callery pear 5902 MULTIPOINT (-74.25443 40.50...
## 4 pin oak 5355 MULTIPOINT (-74.25329 40.50...
## 5 Norway maple 3373 MULTIPOINT (-74.25443 40.50...
## 6 littleleaf linden 3043 MULTIPOINT (-74.25032 40.51...
# Drop the geometry column
species_no_geometry <- st_set_geometry(species_counts_desc, NULL)
# Confirm the geometry column has been dropped
head(species_no_geometry)
## # A tibble: 6 x 2
## species n
## <fct> <int>
## 1 London planetree 8709
## 2 honeylocust 6418
## 3 Callery pear 5902
## 4 pin oak 5355
## 5 Norway maple 3373
## 6 littleleaf linden 3043
# In this exercise you will test joining spatial and non-spatial data. In particular, the trees data you have been working with has a full county name (the variable is called boroname) but does not have the county codes. The neighborhoods file has both a county name (the variable is called boro_name) and the county codes -- neighborhoods are nested within counties
# In this exercise, you will create a non-spatial data frame of county name and county code from the neighborhoods object
# Then you will join this data frame into the spatial trees object with inner_join()
# The packages sf and dplyr and the objects neighborhoods and trees are loaded in your workspace
# Limit to the fields boro_name, county_fip and boro_code
boro <- dplyr::select(neighborhoods, boro_name, county_fip, boro_code)
# Drop the geometry column
boro_no_geometry <- st_set_geometry(boro, NULL)
# Limit to distinct records
boro_distinct <- distinct(boro_no_geometry)
# Join the county detail into the trees object
trees_with_county <- inner_join(trees, boro_distinct, by = c("boroname" = "boro_name"))
# Confirm the new fields county_fip and boro_code exist
head(trees_with_county)
## Simple feature collection with 6 features and 9 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: -74.13116 ymin: 40.62351 xmax: -73.80057 ymax: 40.77393
## epsg (SRID): 4326
## proj4string: +proj=longlat +ellps=WGS84 +no_defs
## tree_id nta longitude latitude stump_diam species boroname
## 1 558423 QN76 -73.80057 40.67035 0 honeylocust Queens
## 2 286191 MN32 -73.95422 40.77393 0 Callery pear Manhattan
## 3 257044 QN70 -73.92309 40.76196 0 Chinese elm Queens
## 4 603262 BK09 -73.99866 40.69312 0 cherry Brooklyn
## 5 41769 SI22 -74.11773 40.63166 0 cherry Staten Island
## 6 24024 SI07 -74.13116 40.62351 0 red maple Staten Island
## county_fip boro_code geometry
## 1 081 4 POINT (-73.80057 40.67035)
## 2 061 1 POINT (-73.95422 40.77393)
## 3 081 4 POINT (-73.92309 40.76196)
## 4 047 3 POINT (-73.99866 40.69312)
## 5 085 5 POINT (-74.11773 40.63166)
## 6 085 5 POINT (-74.13116 40.62351)
# In sf you can use the st_simplify() function to reduce line and polygon complexity
# In this exercise you will measure the size of objects before and after st_simplify() in two ways
# You will compute the size in megabytes using the handy object_size() function in the pryr package and you will count the number of vertices -- the number of points required to delineate a line or polygon
# The packages sf and pryr are loaded in your workspace
# Plot the neighborhoods geometry
plot(st_geometry(neighborhoods), col = "grey")
# Measure the size of the neighborhoods object
utils::object.size(neighborhoods)
## 1890408 bytes
# Compute the number of vertices in the neighborhoods object
pts_neighborhoods <- st_cast(neighborhoods$geometry, "MULTIPOINT")
cnt_neighborhoods <- sapply(pts_neighborhoods, length)
sum(cnt_neighborhoods)
## [1] 210736
# Simplify the neighborhoods object
neighborhoods_simple <- st_simplify(neighborhoods,
preserveTopology = TRUE,
dTolerance = 0.0025)
## Warning in st_simplify.sfc(st_geometry(x), preserveTopology, dTolerance):
## st_simplify does not correctly simplify longitude/latitude data, dTolerance
## needs to be in decimal degrees
# Measure the size of the neighborhoods_simple object
utils::object.size(neighborhoods_simple)
## 248448 bytes
# Compute the number of vertices in the neighborhoods_simple object
pts_neighborhoods_simple <- st_cast(neighborhoods_simple$geometry, "MULTIPOINT")
cnt_neighborhoods_simple <- sapply(pts_neighborhoods_simple, length)
sum(cnt_neighborhoods_simple)
## [1] 4764
# Plot the neighborhoods_simple object geometry
plot(st_geometry(neighborhoods_simple), col = "grey")
# Read in the trees data (done previously)
# trees <- st_read("trees.shp")
# Convert to Spatial class
trees_sp <- as(trees, Class = "Spatial")
# Confirm conversion, should be "SpatialPointsDataFrame"
class(trees_sp)
## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"
# Convert back to sf
trees_sf <- st_as_sf(trees_sp)
# Confirm conversion
class(trees_sf)
## [1] "sf" "data.frame"
# In order to convert a data frame of coordinates into an sf object you can make use of the st_as_sf() function you used in the previous exercise
# You can specify the coords argument with the names of the coordinate variables (with the X coordinate/longitude coordinate listed first) and, optionally, the crs argument if you know the CRS of your coordinates
# The CRS can be specified as a proj4 string or EPSG code
# If you want to convert your sf point objects to a data frame with coordinates, you can use the st_write() function with a
# hidden argument (these are arguments associated with an external utility called GDAL and so they're not in the R help) to force sf to include the coordinates in the output file
# The argument you need is layer_options = "GEOMETRY=AS_XY"
# Read in the CSV (done previously)
# trees <- read.csv("trees.csv")
# Convert the data frame to an sf object
trees_sf <- st_as_sf(trees, coords = c("longitude", "latitude"), crs = 4326)
# Plot the geometry of the points
plot(st_geometry(trees_sf))
# Write the file out with coordinates
st_write(trees_sf, "./RInputFiles/new_trees.csv", layer_options = "GEOMETRY=AS_XY", delete_dsn = TRUE)
## Deleting source `C:\Users\Dave\Documents\Personal\Learning\Coursera\RDirectory\RHomework\DataCamp\RInputFiles\new_trees.csv' using driver `CSV'
## Writing layer `new_trees' to data source `C:\Users\Dave\Documents\Personal\Learning\Coursera\RDirectory\RHomework\DataCamp\RInputFiles\new_trees.csv' using driver `CSV'
## options: GEOMETRY=AS_XY
## features: 65217
## fields: 7
## geometry type: Point
# Read in the file you just created and check coordinates
new_trees <- read.csv("./RInputFiles/new_trees.csv")
head(new_trees)
## X Y tree_id nta longitude latitude stump_diam
## 1 -73.80057 40.67035 558423 QN76 -73.80057 40.67035 0
## 2 -73.95422 40.77393 286191 MN32 -73.95422 40.77393 0
## 3 -73.92309 40.76196 257044 QN70 -73.92309 40.76196 0
## 4 -73.99866 40.69312 603262 BK09 -73.99866 40.69312 0
## 5 -74.11773 40.63166 41769 SI22 -74.11773 40.63166 0
## 6 -74.13116 40.62351 24024 SI07 -74.13116 40.62351 0
## species boroname
## 1 honeylocust Queens
## 2 Callery pear Manhattan
## 3 Chinese elm Queens
## 4 cherry Brooklyn
## 5 cherry Staten Island
## 6 red maple Staten Island
# Read in the canopy layer (done previously)
# canopy <- raster("canopy.tif")
# Plot the canopy raster
plot(canopy)
# Determine the raster resolution
res(canopy)
## [1] 300 300
# Determine the number of cells
ncell(canopy)
## [1] 58190
# Aggregate the raster
canopy_small <- aggregate(canopy, fact = 10)
# Plot the new canopy layer
plot(canopy_small)
# Determine the new raster resolution
res(canopy_small)
## [1] 3000 3000
# Determine the number of cells in the new raster
ncell(canopy_small)
## [1] 598
# Plot the canopy layer to see the values above 100
plot(canopy)
# Set up the matrix
vals <- cbind(100, 300, NA)
# Reclassify
canopy_reclass <- reclassify(canopy, rcl = vals)
# Plot again and confirm that the legend stops at 100
plot(canopy_reclass)
# raster masks dplyr::select
detach("package:raster")
Chapter 3 - Conducting Spatial Analysis with sf and raster
Buffers and centroids:
Bounding boxes, dissolve features and create a convex hull:
Multi-layer geoprocessing and relationships:
Geoprocessing with rasters:
Example code includes:
library(raster)
##
## Attaching package: 'raster'
## The following object is masked from 'package:qdapTools':
##
## shift
## The following object is masked from 'package:qdapRegex':
##
## bind
## The following object is masked from 'package:magrittr':
##
## extract
## The following object is masked from 'package:colorspace':
##
## RGB
## The following objects are masked from 'package:spatstat':
##
## area, rotate, shift
## The following object is masked from 'package:nlme':
##
## getData
## The following object is masked from 'package:dplyr':
##
## select
# Computing buffers is a key spatial analysis skill and the resulting buffers have a wide range of uses like, for example, identifying the number of roads within one kilometer of a school
# or computing the number of hazardous waste sites near sensitive natural areas
# Although, technically you can buffer data with unprojected coodinate reference systems, the buffer distance will be more meaningful with a projected CRS
# so it is highly recommended that you transform unprojected data to a projected CRS before buffering
df <- data.frame(place=c("Empire State Building", "Museum of Natural History"),
longitude=c(-73.98566, -73.97398),
latitude=c(40.74844, 40.78132),
stringsAsFactors = TRUE
)
# Review df
df
## place longitude latitude
## 1 Empire State Building -73.98566 40.74844
## 2 Museum of Natural History -73.97398 40.78132
# Convert the data frame to an sf object
df_sf <- st_as_sf(df, coords = c("longitude", "latitude"), crs=4326)
# Transform the points to match the manhattan CRS
df_crs <- st_transform(df_sf, crs = crs(manhattan, asText = TRUE))
# Buffer the points
df_buf <- st_buffer(df_crs, dist = 1000)
# Plot the manhattan image (it is multi-band)
plotRGB(manhattan)
plot(st_geometry(df_buf), col = "firebrick", add = TRUE)
plot(st_geometry(df_crs), pch = 16, add = TRUE)
# Similar to buffering, computing polygon centroids is a bedrock geoprocessing task used to assign values and even to help with labeling maps. The function for this in sf is st_centroid()
# Also similar to buffering, centroid calculations should generally be performed on data with a projected coordinate reference system
# Read in the neighborhods shapefile (done previously)
# neighborhoods <- st_read("neighborhoods.shp")
# Project neighborhoods to match manhattan
neighborhoods_tf <- st_transform(neighborhoods, crs = 32618)
# Compute the neighborhood centroids
centroids <- st_centroid(neighborhoods_tf)
# Plot the neighborhood geometry
plot(st_geometry(neighborhoods_tf), col = "grey", border = "white")
plot(centroids$geometry, pch = 16, col = "firebrick", add = TRUE)
# You can compute bounding boxes around vector data using sf
# These can help you, for example, create polygons to clip layers to a common area for an analysis or identify regions of influence
# In the sf package, there is a function for extracting the bounding box coordinates, if that's all you need, this is st_bbox()
# More likely you'll want to create a new sf object (a polygon) from those coordinates and to do this sf provides the st_make_grid() function
# st_make_grid() can be used to make a multi-row and multi-column grid covering your input data but it can also be used to make a grid of just one cell (a bounding box)
# To do this, you need to specify the number of grid cells as n = 1
# Use filter() to limit to honey locust trees
beech <- trees %>% filter(species %in% c("European beech", "American beech"))
str(beech)
## Classes 'sf' and 'data.frame': 27 obs. of 8 variables:
## $ tree_id : num 182961 163271 221707 16250 183728 ...
## $ nta : Factor w/ 188 levels "BK09","BK17",..: 169 107 119 58 180 146 106 135 169 171 ...
## $ longitude : num -73.9 -74 -73.8 -73.8 -74.1 ...
## $ latitude : num 40.8 40.8 40.7 40.8 40.6 ...
## $ stump_diam: num 0 0 0 0 0 0 0 0 0 0 ...
## $ species : Factor w/ 132 levels "'Schubert' chokecherry",..: 50 50 2 50 2 2 2 50 2 2 ...
## $ boroname : Factor w/ 5 levels "Bronx","Brooklyn",..: 4 3 4 1 5 4 3 4 4 5 ...
## $ geometry :sfc_POINT of length 27; first list element: Classes 'XY', 'POINT', 'sfg' num [1:2] -73.9 40.8
## - attr(*, "sf_column")= chr "geometry"
## - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA
## ..- attr(*, "names")= chr "tree_id" "nta" "longitude" "latitude" ...
# Plot the neighborhoods and beech trees
plot(st_geometry(neighborhoods), col = "grey", border = "white")
plot(beech$geometry, add = TRUE, pch = 16, col = "forestgreen")
# Compute the coordinates of the bounding box
st_bbox(beech)
## xmin ymin xmax ymax
## -74.17746 40.54247 -73.70872 40.85696
# Create a bounding box polygon
beech_box <- st_make_grid(beech, n = 1)
# Plot the neighborhoods, add the beech trees and add the new box
plot(st_geometry(neighborhoods), col = "grey", border = "white")
plot(beech$geometry, add = TRUE, pch = 16, col = "forestgreen")
plot(beech_box, add = TRUE)
# In order to compute a tighter bounding box, a convex hull, around a set of points like the beech trees from the previous exercise you'll need to learn one more function first
# For points you don't want a convex hull around each point! This doesn't even make sense
# More likely you want to compute a convex hull around all your points
# If you have a set of points and you want to draw a convex hull around them you first need to bundle the points into a single MULTIPOINT feature and in order to do this you will use the dissolve function in sf called st_union()
# With polygons, st_union() will dissolve all the polygons into a single polygon representing the area where all the polygons overlap
# Your set of individual points will be dissolved/unioned into a single, MULTIPOINT feature that you can use for tasks like computing the convex hull
# Buffer the beech trees by 3000
beech_buffer <- st_buffer(beech, 0.025)
## Warning in st_buffer.sfc(st_geometry(x), dist, nQuadSegs): st_buffer does
## not correctly buffer longitude/latitude data
## dist is assumed to be in decimal degrees (arc_degrees).
# Limit the object to just geometry
beech_buffers <- st_geometry(beech_buffer)
# Compute the number of features in beech_buffer
length(beech_buffers)
## [1] 27
# Plot the tree buffers
plot(beech_buffers)
# Dissolve the buffers
beech_buf_union <- st_union(beech_buffers)
# Compute the number of features in beech_buf_union
length(beech_buf_union)
## [1] 1
# Plot the dissolved buffers
plot(beech_buf_union)
# A more precise bounding polygon is sometimes needed, one that fits your data more neatly
# For this, you can use the st_convex_hull() function
# Note that st_convex_hull() will compute a tight box around each one of your features individually so if you want to create a convex hull around a
# group of features you'll need to use st_union() to combine individual features into a single multi-feature
# Look at the data frame to see the type of geometry
head(beech)
## Simple feature collection with 6 features and 7 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: -74.12843 ymin: 40.56829 xmax: -73.71567 ymax: 40.84684
## epsg (SRID): 4326
## proj4string: +proj=longlat +ellps=WGS84 +no_defs
## tree_id nta longitude latitude stump_diam species boroname
## 1 182961 QN72 -73.90401 40.76897 0 European beech Queens
## 2 163271 MN31 -73.95473 40.77224 0 European beech Manhattan
## 3 221707 QN06 -73.79034 40.72523 0 American beech Queens
## 4 16250 BX10 -73.78911 40.84684 0 European beech Bronx
## 5 183728 SI25 -74.12843 40.56829 0 American beech Staten Island
## 6 591657 QN43 -73.71567 40.73604 0 American beech Queens
## geometry
## 1 POINT (-73.90401 40.76897)
## 2 POINT (-73.95473 40.77224)
## 3 POINT (-73.79034 40.72523)
## 4 POINT (-73.78911 40.84684)
## 5 POINT (-74.12843 40.56829)
## 6 POINT (-73.71567 40.73604)
# Convert the points to a single multi-point
beech1 <- st_union(beech)
# Look at the data frame to see the type of geometry
head(beech1)
## Geometry set for 1 feature
## geometry type: MULTIPOINT
## dimension: XY
## bbox: xmin: -74.17746 ymin: 40.54247 xmax: -73.70872 ymax: 40.85696
## epsg (SRID): 4326
## proj4string: +proj=longlat +ellps=WGS84 +no_defs
## MULTIPOINT (-74.17746 40.54247, -74.13941 40.61...
# Confirm that we went from 17 features to 1 feature
length(beech)
## [1] 8
length(beech1)
## [1] 1
# Compute the tight bounding box
beech_hull <- st_convex_hull(beech1)
# Plot the points together with the hull
plot(beech_hull, col = "red")
plot(beech1, add = TRUE)
# For many analysis types you need to link geographies spatially
# For example, you want to know how many trees are in each neighborhood but you don't have a neighborhood attribute in the tree data
# The best way to do this is with a spatial join using st_join()
# Importantly, the st_join() function requires sf data frames as input and will not accept an object that is just sf geometry
# You can use the st_sf() function to convert sf geometry objects to an sf data frame (st_sf() is essentially the opposite of st_geometry())
# Plot the beech on top of the neighborhoods
plot(st_geometry(neighborhoods))
plot(beech$geometry, add = TRUE, pch = 16, col = "red")
# Determine whether beech has class data.frame
class(beech)
## [1] "sf" "data.frame"
# Convert the beech geometry to a sf data frame
beech_df <- st_sf(beech)
# Confirm that beech now has the data.frame class
class(beech_df)
## [1] "sf" "data.frame"
# Join the beech trees with the neighborhoods
beech_neigh <- st_join(beech_df, neighborhoods)
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
# Confirm that beech_neigh has the neighborhood information
head(beech_neigh)
## Simple feature collection with 6 features and 12 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: -74.12843 ymin: 40.56829 xmax: -73.71567 ymax: 40.84684
## epsg (SRID): 4326
## proj4string: +proj=longlat +ellps=WGS84 +no_defs
## tree_id nta longitude latitude stump_diam species boroname
## 1 182961 QN72 -73.90401 40.76897 0 European beech Queens
## 2 163271 MN31 -73.95473 40.77224 0 European beech Manhattan
## 3 221707 QN06 -73.79034 40.72523 0 American beech Queens
## 4 16250 BX10 -73.78911 40.84684 0 European beech Bronx
## 5 183728 SI25 -74.12843 40.56829 0 American beech Staten Island
## 6 591657 QN43 -73.71567 40.73604 0 American beech Queens
## county_fip boro_name ntacode ntaname
## 1 081 Queens QN72 Steinway
## 2 061 Manhattan MN31 Lenox Hill-Roosevelt Island
## 3 081 Queens QN06 Jamaica Estates-Holliswood
## 4 005 Bronx BX10 Pelham Bay-Country Club-City Island
## 5 085 Staten Island SI25 Oakwood-Oakwood Beach
## 6 081 Queens QN43 Bellerose
## boro_code geometry
## 1 4 POINT (-73.90401 40.76897)
## 2 1 POINT (-73.95473 40.77224)
## 3 4 POINT (-73.79034 40.72523)
## 4 2 POINT (-73.78911 40.84684)
## 5 5 POINT (-74.12843 40.56829)
## 6 4 POINT (-73.71567 40.73604)
# In this exercise you will determine which neighborhoods are at least partly within 2000 meters of the Empire State Building with st_intersects()
# and those that are completely within 2000 meters of the Empire State Building using st_contains()
# You will then use the st_intersection() function (notice the slight difference in function name!) to clip the neighborhoods to the buffer
# A note about the output of functions that test relationships between two sets of features
# The output of these and related functions is a special kind of list (with the class sgbp)
# For example, when using st_intersects(), the first element in the output can be accessed using [[1]], which shows polygons from the second polygon that intersect with the first polygon
# Likewise, [[2]] would show the polygons from from the first polygon that intersect with the second polygon
# Review df
df
## place longitude latitude
## 1 Empire State Building -73.98566 40.74844
## 2 Museum of Natural History -73.97398 40.78132
df_mod <- df %>% filter(place == "Empire State Building")
df_sf_mod <- st_as_sf(df_mod, coords = c("longitude", "latitude"), crs=4326)
df_crs_mod <- st_transform(df_sf_mod, crs = crs(manhattan, asText = TRUE))
buf_mod <- st_buffer(df_crs_mod, dist = 2000)
buf <- st_transform(buf_mod, "+proj=longlat +ellps=WGS84 +no_defs")
# Identify neighborhoods that intersect with the buffer
neighborhoods_int <- st_intersects(buf, neighborhoods)
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
# Identify neighborhoods contained by the buffer
neighborhoods_cont <- st_contains(buf, neighborhoods)
## although coordinates are longitude/latitude, st_contains assumes that they are planar
# Get the indexes of which neighborhoods intersect
# and are contained by the buffer
int <- neighborhoods_int[[1]]
cont <- neighborhoods_cont[[1]]
# Get the names of the names of neighborhoods in buffer
neighborhoods$ntaname[int]
## [1] Clinton
## [2] Midtown-Midtown South
## [3] Turtle Bay-East Midtown
## [4] Murray Hill-Kips Bay
## [5] Gramercy
## [6] Hudson Yards-Chelsea-Flatiron-Union Square
## [7] West Village
## [8] Stuyvesant Town-Cooper Village
## [9] East Village
## 195 Levels: Airport ... Yorkville
# Clip the neighborhood layer by the buffer (ignore the warning)
neighborhoods_clip <- st_intersection(buf, neighborhoods)
## although coordinates are longitude/latitude, st_intersection assumes that they are planar
## Warning: attribute variables are assumed to be spatially constant
## throughout all geometries
# Plot the geometry of the clipped neighborhoods
plot(st_geometry(neighborhoods_clip), col = "red")
plot(neighborhoods[cont,]$geometry, add = TRUE, col = "yellow")
# Of course, measuring distance between feature sets is a component of spatial analysis 101 -- a core skill for any analyst
# There are several functions in base R as well as in the packages rgeos and geosphere to compute distances, but the st_distance() function from sf
# provides a useful feature-to-feature distance matrix as output and can be used for most distance calculation needs
# In this exercise you'll measure the distance from the Empire State Building to all the parks and identify the closest one
# Read in the parks object (done previously)
# parks <- st_read("parks.shp")
empire_state <- df_crs_mod
str(empire_state)
## Classes 'sf' and 'data.frame': 1 obs. of 2 variables:
## $ place : Factor w/ 2 levels "Empire State Building",..: 1
## $ geometry:sfc_POINT of length 1; first list element: Classes 'XY', 'POINT', 'sfg' num [1:2] 585632 4511327
## - attr(*, "sf_column")= chr "geometry"
## - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA
## ..- attr(*, "names")= chr "place"
# Test whether the CRS match
st_crs(empire_state) == st_crs(parks)
## [1] FALSE
# Project parks to match empire state
parks_es <- st_transform(parks, crs = st_crs(empire_state))
# Compute the distance between empire_state and parks_es
d <- st_distance(empire_state, parks_es)
# Take a quick look at the result
head(d)
## Units: m
## [1] 3055.791 19969.336 22737.903 13846.358 4604.069 18541.779
# Find the index of the nearest park
nearest <- which.min(d)
# Identify the park that is nearest
parks_es[nearest, ]
## Simple feature collection with 1 feature and 14 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 585392 ymin: 4511320 xmax: 585418.2 ymax: 4511379
## epsg (SRID): 26918
## proj4string: +proj=utm +zone=18 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
## location communityb
## 188 Broadway, Av of the Americas, bet. W. 32 St. and W. 33 St. 105
## nys_senate signname zipcode us_congres gispropnum borough
## 188 27 Greeley Square Park 10001 12 M032 M
## waterfront nys_assemb councildis acres typecatego address
## 188 No 75 3 0.144 Triangle/Plaza 894 6 AVENUE
## geometry
## 188 MULTIPOLYGON (((585411 4511...
# Mask and crop are similar operations that allow you to limit your raster to a specific area of interest
# With mask() you essentially place your area of interest on top of the raster and any raster cells outside of the boundary are assigned NA values
# A reminder that currently the raster package does not support sf objects so they will need to be converted to Spatial objects with, for example, as(input, "Spatial").
# Project parks to match canopy
parks_cp <- st_transform(parks, crs = crs(canopy, asText = TRUE))
# Compute the area of the parks
areas <- st_area(parks_cp)
# Filter to parks with areas > 30000
parks_big <- filter(parks_cp, unclass(areas) > 30000)
# Plot the canopy raster
plot(canopy)
# Plot the geometry of parks_big
plot(st_geometry(parks_big))
# Convert parks to a Spatial object
parks_sp <- as(parks_big, "Spatial")
# Mask the canopy layer with parks_sp and save as canopy_mask
canopy_mask <- mask(canopy, mask = parks_sp)
# Plot canopy_mask -- this is a raster!
plot(canopy_mask)
# As you saw in the previous exercise with mask(), the raster extent is not changed
# If the extents of the input raster and the mask itself are different then they will still be different after running mask()
# In many cases, however, you will want your raster to share an extent with another layer and this is where crop() comes in handy
# With crop() you are cropping the raster so that the extent (the bounding box) of the raster matches the extent of the input crop layer
# But within the bounding box no masking is done (no raster cells are set to NA)
# In this exercise you will both mask and crop the NYC canopy layer based on the large parks and you'll compare
# You should notice that the masked raster includes a lot of NA values (there are the whitespace) and that the extent is the same as the original canopy layer
# With the cropped layer you should notice that the extent of the cropped canopy layer matches the extent of the large parks (essentially it's zoomed in)
# Convert the parks_big layer (this is preloaded, it has been limited to large parks and projected) to a Spatial object with as() -- call this parks_sp
# Convert the parks_big to a Spatial object
parks_sp <- as(parks_big, "Spatial")
# Mask the canopy with the large parks
canopy_mask <- mask(canopy, mask = parks_sp)
# Plot the mask
plot(canopy_mask)
# Crop canopy with parks_sp
canopy_crop <- crop(canopy, parks_sp)
# Plot the cropped version and compare
plot(canopy_crop)
# Beyond simply masking and cropping you may want to know the actual cell values at locations of interest
# You might, for example, want to know the percentage canopy at your landmarks or within the large parks
# This is where the extract() function comes in handy
# Usefully, and you'll see this in a later analysis, you can feed extract() a function that will get applied to extracted cells
# For example, you can use extract() to extract raster values by neighborhood and with the fun = mean argument it will return an average cell value by neighborhood
# Similar to other raster functions, it is not yet set up to accept sf objects so you'll need to convert to a Spatial object
# Project the landmarks to match canopy
# landmarks_cp <- st_transform(landmarks, crs = crs(canopy, asText = TRUE))
# Convert the landmarks to a Spatial object
# landmarks_sp <- as(landmarks_cp, "Spatial")
# Extract the canopy values at the landmarks
# landmarks_ex <- extract(canopy, landmarks_sp)
# Look at the landmarks and extraction results
# landmarks_cp
# landmarks_ex
# You will now use the canopy layer and an "imperviousness" layer from the same source, the United States Geological Survey
# Imperviousness measures whether water can pass through a surface
# So a high percentage impervious surface might be a road that does not let water pass through while a low percentage impervious might be something like a lawn
# What you will do in this exercise is essentially identify the most urban locations by finding areas that have both a low percentage of tree canopy (< 20%) and high percentage of impervious (> 80%)
# To do this, we defined the function f to do the raster math for you
# Read in the canopy (already read in) and impervious layer
# canopy <- raster("canopy.tif")
impervious <- raster("./RInputFiles/ZIP Files/impervious/impervious.tif")
# Function f with 2 arguments and the raster math code
f <- function(rast1, rast2) {
rast1 < 20 & rast2 > 80
}
# Do the overlay using f as fun
canopy_imperv_overlay <- overlay(canopy, impervious, fun = f)
# Plot the result (low tree canopy and high impervious areas)
plot(canopy_imperv_overlay)
# raster masks dplyr::select
detach("package:raster")
Chapter 4 - Combine Skills in Mini-Analysis
Compute tree density and average tree canopy by neighborhood:
First look at results with ggplot2:
Create final, polished maps with tmap:
Wrap up:
Example code includes:
library(raster)
##
## Attaching package: 'raster'
## The following object is masked from 'package:qdapTools':
##
## shift
## The following object is masked from 'package:qdapRegex':
##
## bind
## The following object is masked from 'package:magrittr':
##
## extract
## The following object is masked from 'package:colorspace':
##
## RGB
## The following objects are masked from 'package:spatstat':
##
## area, rotate, shift
## The following object is masked from 'package:nlme':
##
## getData
## The following object is masked from 'package:dplyr':
##
## select
# In order to compute tree density by neighborhood you need two things
# You will need to know the area of the neighborhoods, which you will compute in the next exercise
# And you need the tree counts by neighborhood which is the focus of this exercise
# You will produce counts of all trees by neighborhood in NYC and create a single data frame with a column for total trees
# The result should be a data frame with no geometry
# sf and dplyr are loaded in the workspace
# Compute the counts of all trees by hood (nta)
tree_counts <- count(trees, nta)
# Take a quick look
head(tree_counts)
## Simple feature collection with 6 features and 2 fields
## geometry type: MULTIPOINT
## dimension: XY
## bbox: xmin: -74.00396 ymin: 40.57265 xmax: -73.92026 ymax: 40.70249
## epsg (SRID): 4326
## proj4string: +proj=longlat +ellps=WGS84 +no_defs
## # A tibble: 6 x 3
## nta n geometry
## <fct> <int> <sf_geometry [degree]>
## 1 BK09 174 MULTIPOINT (-73.99901 40.69...
## 2 BK17 499 MULTIPOINT (-73.95982 40.58...
## 3 BK19 132 MULTIPOINT (-73.97331 40.57...
## 4 BK21 136 MULTIPOINT (-74.00396 40.58...
## 5 BK23 53 MULTIPOINT (-73.98038 40.57...
## 6 BK25 396 MULTIPOINT (-73.9718 40.607...
# Remove the geometry
tree_counts_no_geom <- st_set_geometry(tree_counts, NULL)
# Rename the n variable to tree_cnt
tree_counts_renamed <- rename(tree_counts_no_geom, tree_cnt = n)
# Create histograms of the total counts
hist(tree_counts_renamed$tree_cnt)
# We have the tree counts (from the previous exercise)
# In this exercise you will compute neighborhood areas, add them to the neighborhood sf object and then you'll join in the non-spatial tree counts data frame from the previous exercise
# Compute areas and unclass
areas <- unclass(st_area(neighborhoods))
# Add the areas to the neighborhoods object
neighborhoods_area <- mutate(neighborhoods, area = areas)
# Join neighborhoods and counts
neighborhoods_counts <- left_join(neighborhoods_area, tree_counts_renamed, by = c("ntacode"="nta"))
## Warning: Column `ntacode`/`nta` joining factors with different levels,
## coercing to character vector
# Replace NA values with 0
neighborhoods_counts <- mutate(neighborhoods_counts,
tree_cnt = ifelse(is.na(tree_cnt),
0, tree_cnt))
# Compute the density
neighborhoods_counts <- mutate(neighborhoods_counts,
tree_density = tree_cnt/area)
# In the previous exercises you computed tree density by neighborhood using tree counts
# In this exercise you will compute average tree canopy by neighborhood as a percentage so that we can compare if the results are similar
# Confirm that you have the neighborhood density results
head(neighborhoods_counts)
## Simple feature collection with 6 features and 8 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -74.00736 ymin: 40.61264 xmax: -73.77574 ymax: 40.8355
## epsg (SRID): 4326
## proj4string: +proj=longlat +ellps=WGS84 +no_defs
## county_fip boro_name ntacode ntaname boro_code area
## 1 047 Brooklyn BK88 Borough Park 3 5017229
## 2 081 Queens QN52 East Flushing 4 2736433
## 3 081 Queens QN48 Auburndale 4 3173995
## 4 081 Queens QN51 Murray Hill 4 4876380
## 5 081 Queens QN27 East Elmhurst 4 1832715
## 6 005 Bronx BX35 Morrisania-Melrose 2 1569317
## tree_cnt tree_density geometry
## 1 565 0.0001126120 MULTIPOLYGON (((-73.97605 4...
## 2 295 0.0001078046 MULTIPOLYGON (((-73.79493 4...
## 3 507 0.0001597356 MULTIPOLYGON (((-73.77574 4...
## 4 732 0.0001501114 MULTIPOLYGON (((-73.80379 4...
## 5 211 0.0001151297 MULTIPOLYGON (((-73.8611 40...
## 6 214 0.0001363650 MULTIPOLYGON (((-73.89697 4...
# Transform the neighborhoods CRS to match the canopy layer
neighborhoods_crs <- st_transform(neighborhoods_counts, crs = crs(canopy, asText = TRUE))
# Convert neighborhoods object to a Spatial object
neighborhoods_sp <- as(neighborhoods_crs, "Spatial")
# Compute the mean of canopy values by neighborhood
canopy_neighborhoods <- extract(canopy_small, neighborhoods_sp, fun = mean)
# Add the mean canopy values to neighborhoods
neighborhoods_avg_canopy <- mutate(neighborhoods_counts, avg_canopy = as.vector(canopy_neighborhoods))
# Create a histogram of tree density (tree_density)
ggplot(neighborhoods_avg_canopy, aes(x = tree_density)) +
geom_histogram(color = "white")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Create a histogram of average canopy (avg_canopy)
ggplot(neighborhoods_avg_canopy, aes(x = avg_canopy)) +
geom_histogram(color = "white")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Create a scatter plot of tree_density vs avg_canopy
ggplot(neighborhoods_avg_canopy, aes(x = tree_density, y = avg_canopy)) +
geom_point() +
stat_smooth()
## `geom_smooth()` using method = 'loess'
# Compute the correlation between density and canopy
cor(neighborhoods_avg_canopy$tree_density, neighborhoods_avg_canopy$avg_canopy)
## [1] -0.08667411
# The geom_sf() function operates like any other layer in ggplot2 where you can link variables to aesthetics on the plot through the aes() function
# In a mapping context this might mean, for example, creating a choropleth map by color coding the polygons based on a variable
# If you leave off the aesthetic mapping geom_sf() will map the geometry alone
# Note: geom_sf() is still in the development version of ggplot2 on GitHub. If you want to use geom_sf() on your machine, you need to install the dev version
# devtools::install_github("tidyverse/ggplot2")
# Plot the tree density with default colors
# ggplot(neighborhoods_avg_canopy) +
# geom_sf(aes(fill = tree_density))
# Plot the tree canopy with default colors
# ggplot(neighborhoods) +
# geom_sf(aes(fill = avg_canopy))
# Plot the tree density using scale_fill_gradient()
# ggplot(neighborhoods) +
# geom_sf(aes(fill = tree_density)) +
# scale_fill_gradient(low = "#edf8e9", high = "#005a32")
# Plot the tree canopy using the scale_fill_gradient()
# ggplot(neighborhoods) +
# geom_sf(aes(fill = avg_canopy)) +
# scale_fill_gradient(low = "#edf8e9", high = "#005a32")
# Create a simple map of neighborhoods
library(tmap)
tm_shape(neighborhoods_avg_canopy) +
tm_polygons()
# Create a color-coded map of neighborhood tree density
tm_shape(neighborhoods_avg_canopy) +
tm_polygons(col="tree_density")
# Style the tree density map
tm_shape(neighborhoods_avg_canopy) +
tm_polygons("tree_density", palette = "Greens",
style = "quantile", n = 7,
title = "Trees per sq. KM")
# Create a similar map of average tree canopy
tm_shape(neighborhoods_avg_canopy) +
tm_polygons("avg_canopy", palette = "Greens",
style = "quantile", n = 7,
title = "Average tree canopy (%)")
# Create a map of the manhattan aerial photo
tm_shape(manhattan) +
tm_rgb()
# Create a map of the neighborhood polygons
tm_shape(neighborhoods_avg_canopy) +
tm_borders(col = "black", lwd = 0.5, alpha = 0.5)
# Combine the aerial photo and neighborhoods into one map
map1 <- tm_shape(manhattan) +
tm_rgb() +
tm_shape(neighborhoods_avg_canopy) +
tm_borders(col = "black", lwd = 0.5, alpha = 0.5)
# Create the second map of tree measures (bbox causing errors . . . )
# map2 <- tm_shape(neighborhoods_avg_canopy, bbox = bbox(manhattan)) +
map2 <- tm_shape(neighborhoods_avg_canopy) +
tm_polygons(c("tree_density", "avg_canopy"),
style = "quantile",
palette = "Greens",
title = c("Tree Density", "Average Tree Canopy"))
# Combine the two maps into one
tmap_arrange(map1, map2, asp = NA)
# raster masks dplyr::select
detach("package:raster")
Chapter 1 - Fast and Dirty - Polarity Scoring
Sentiment Analysis and Feelings:
Zipf’s Law, Number of Words, Subjectivity Lexicon:
Explore qdap - Polarity and Lexicon:
Example code includes:
# Call the libraries in a non-cached chunk
library(magrittr)
library(qdap)
## Loading required package: qdapDictionaries
## Loading required package: qdapRegex
##
## Attaching package: 'qdapRegex'
## The following object is masked from 'package:ggplot2':
##
## %+%
## The following object is masked from 'package:dplyr':
##
## explain
## Loading required package: qdapTools
##
## Attaching package: 'qdapTools'
## The following object is masked from 'package:spatstat':
##
## shift
## The following object is masked from 'package:dplyr':
##
## id
## Loading required package: RColorBrewer
##
## Attaching package: 'qdap'
## The following object is masked from 'package:magrittr':
##
## %>%
## The following object is masked from 'package:sf':
##
## %>%
## The following object is masked from 'package:dplyr':
##
## %>%
## The following object is masked from 'package:base':
##
## Filter
library(tm)
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:qdap':
##
## ngrams
## The following object is masked from 'package:ggplot2':
##
## annotate
##
## Attaching package: 'tm'
## The following objects are masked from 'package:qdap':
##
## as.DocumentTermMatrix, as.TermDocumentMatrix
Followed by:
# We created text_df representing a conversation with person and text columns
# Use qdap's polarity() function to score text_df
# polarity() will accept a single character object or data frame with a grouping variable to calculate a positive or negative score
# In this example you will use the magrittr package's dollar pipe operator %$%
# The dollar sign forwards the data frame into polarity() and you declare a text column name or the text column and a grouping variable without quotes
# text_data_frame %$% polarity(text_column_name)
# To create an object with the dollar sign operator:
# polarity_object <- text_data_frame %$%
# polarity(text_column_name, grouping_column_name)
# More specifically, to make a quantitative judgement about the sentiment of some text, you need to give it a score
# A simple method is a positive or negative value related to a sentence, passage or a collection of documents called a corpus
# Scoring with positive or negative values only is called "polarity."
# A useful function for extracting polarity scores is counts() applied to the polarity object
# For a quick visual call plot() on the polarity() outcome
# From http://magrittr.tidyverse.org/
# Many functions accept a data argument, e.g. lm and aggregate, which is very useful in a pipeline where data is first processed and then passed into such a function
# There are also functions that do not have a data argument, for which it is useful to expose the variables in the data
# This is done with the %$% operator
# iris %>%
# subset(Sepal.Length > mean(Sepal.Length)) %$%
# cor(Sepal.Length, Sepal.Width)
library(magrittr)
library(qdap)
text_df <- data.frame(
person=c('Nick', 'Jonathan', 'Martijn', 'Nicole', 'Nick', 'Jonathan', 'Martijn', 'Nicole'),
text=c('DataCamp courses are the best', 'I like talking to students', 'Other online data science curricula are boring.', 'What is for lunch?', 'DataCamp has lots of great content!', 'Students are passionate and are excited to learn', 'Other data science curriculum is hard to learn and difficult to understand', 'I think the food here is good.'),
stringsAsFactors=TRUE
)
# Examine the text data
text_df
## person
## 1 Nick
## 2 Jonathan
## 3 Martijn
## 4 Nicole
## 5 Nick
## 6 Jonathan
## 7 Martijn
## 8 Nicole
## text
## 1 DataCamp courses are the best
## 2 I like talking to students
## 3 Other online data science curricula are boring.
## 4 What is for lunch?
## 5 DataCamp has lots of great content!
## 6 Students are passionate and are excited to learn
## 7 Other data science curriculum is hard to learn and difficult to understand
## 8 I think the food here is good.
# Calc overall polarity score
text_df %$% qdap::polarity(text)
## all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 all 8 54 0.179 0.452 0.396
# Calc polarity score by person
(datacamp_conversation <- text_df %$% qdap::polarity(text, person))
## person total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 Jonathan 2 13 0.577 0.184 3.141
## 2 Martijn 2 19 -0.478 0.141 -3.388
## 3 Nick 2 11 0.428 0.028 15.524
## 4 Nicole 2 11 0.189 0.267 0.707
# Counts table from datacamp_conversation
qdap::counts(datacamp_conversation)
## person wc polarity pos.words neg.words text.var
## 1 Nick 5 0.447 best - DataCamp courses are the best
## 2 Jonathan 5 0.447 like - I like talking to students
## 3 Martijn 7 -0.378 - boring Other online data science curricula are boring.
## 4 Nicole 4 0.000 - - What is for lunch?
## 5 Nick 6 0.408 great - DataCamp has lots of great content!
## 6 Jonathan 8 0.707 passionate, excited - Students are passionate and are excited to learn
## 7 Martijn 12 -0.577 - hard, difficult Other data science curriculum is hard to learn and difficult to understand
## 8 Nicole 7 0.378 good - I think the food here is good.
# Plot the conversation polarity
plot(datacamp_conversation)
## Warning: `show_guide` has been deprecated. Please use `show.legend`
## instead.
## Warning: `show_guide` has been deprecated. Please use `show.legend`
## instead.
# In the Text Mining: Bag of Words course you learned that a corpus is a set of texts, and you studied some functions for preprocessing the text
# To recap, one way to create a corpus is with the functions below
# Even though this is a different course, sentiment analysis is part of text mining so a refresher can be helpful
# Turn a character vector into a text source using VectorSource().
# Turn a text source into a corpus using VCorpus().
# Remove unwanted characters from the corpus using cleaning functions like removePunctuation() and stripWhitespace() from tm, and replace_abbreviation() from qdap
# In this exercise a custom clean_corpus() function has been created using standard preprocessing functions for easier application
# clean_corpus() accepts the output of VCorpus() and applies cleaning functions. For example:
# processed_corpus <- clean_corpus(my_corpus)
library(tm)
clean_corpus <- function(corpus){
corpus <- tm_map(corpus, content_transformer(replace_abbreviation))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, c(stopwords("en"), "coffee"))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, stripWhitespace)
return(corpus)
}
# Your R session has a text vector, tm_define, containing two small documents and the function clean_corpus().
tm_define <- c("Text mining is the process of distilling actionable insights from text.", "Sentiment analysis represents the set of tools to extract an author's feelings towards a subject.")
# clean_corpus(), tm_define are pre-defined
clean_corpus
## function(corpus){
## corpus <- tm_map(corpus, content_transformer(replace_abbreviation))
## corpus <- tm_map(corpus, removePunctuation)
## corpus <- tm_map(corpus, removeNumbers)
## corpus <- tm_map(corpus, removeWords, c(stopwords("en"), "coffee"))
## corpus <- tm_map(corpus, content_transformer(tolower))
## corpus <- tm_map(corpus, stripWhitespace)
## return(corpus)
## }
tm_define
## [1] "Text mining is the process of distilling actionable insights from text."
## [2] "Sentiment analysis represents the set of tools to extract an author's feelings towards a subject."
# Create a VectorSource
tm_vector <- VectorSource(tm_define)
# Apply VCorpus
tm_corpus <- VCorpus(tm_vector)
# Examine the first document's contents
content(tm_corpus[[1]])
## [1] "Text mining is the process of distilling actionable insights from text."
# Clean the text
tm_clean <- clean_corpus(tm_corpus)
# Reexamine the contents of the first doc
content(tm_clean[[1]])
## [1] "text mining process distilling actionable insights text"
# Now let's create a Document Term Matrix (DTM). In a DTM
# Each row of the matrix represents a document.
# Each column is a unique word token.
# Values of the matrix correspond to an individual document's word usage
# The DTM is the basis for many bag of words analyses
# Later in the course, you will also use the related Term Document Matrix (TDM)
# This is the transpose; that is, columns represent documents and rows represent unique word tokens
# You should construct a DTM after cleaning the corpus (using clean_corpus())
# To do so, call DocumentTermMatrix() on the corpus object
# tm_dtm <- DocumentTermMatrix(tm_clean)
# If you need a more in-depth refresher check out the Text Mining: Bag of Words course
# Hopefully these two exercises have prepared you well enough to embark on your sentiment analysis journey!
# We've created a VCorpus() object called clean_text containing 1000 tweets mentioning coffee
# The tweets have been cleaned with the previously mentioned preprocessing steps and your goal is to create a DTM from it
# clean_text is pre-defined (do not have VCorpus)
# clean_text
# Create tf_dtm
# tf_dtm <- DocumentTermMatrix(clean_text)
# Create tf_dtm_m
# tf_dtm_m <- as.matrix(tf_dtm)
# Dimensions of DTM matrix
# dim(tf_dtm_m)
# Subset part of tf_dtm_m for comparison
# tf_dtm_m[16:20, 2975:2985]
# Although Zipf observed a steep and predictable decline in word usage you may not buy into Zipf's law
# You may be thinking "I know plenty of words, and have a distinctive vocabulary"
# That may be the case, but the same can't be said for most people!
# To prove it, let's construct a visual from 3 million tweets mentioning "#sb"
# Keep in mind that the visual doesn't follow Zipf's law perfectly, the tweets all mentioned the same hashtag so it is a bit skewed
# That said, the visual you will make follows a steep decline showing a small lexical diversity among the millions of tweets
# So there is some science behind using lexicons for natural language analysis!
# In this exercise, you will use the package metricsgraphics
# Although the author suggests using the pipe %>% operator, you will construct the graphic step-by-step to learn about the various aspects of the plot
# The main function of the package metricsgraphics is the mjs_plot() function which is the first step in creating a JavaScript plot
# Once you have that, you can add other layers on top of the plot
# An example metricsgraphics workflow without using the %>% operator is below
# metro_plot <- mjs_plot(data, x = x_axis_name, y = y_axis_name, show_rollover_text = FALSE)
# metro_plot <- mjs_line(metro_plot)
# metro_plot <- mjs_add_line(metro_plot, line_one_values)
# metro_plot <- mjs_add_legend(metro_plot, legend = c('names', 'more_names'))
# metro_plot
rawWords <- c('sb', 'rt', 'the', 'to', 'a', 'for', 'esurancesweepstakes', 'you', 'broncos', 'esurance', 'in', 'is', 'of', 'on', 'win', 'and', 'panthers', 'nfl', 'i', 'super', 'at', 'with', 'bowl', 'this', 'your', 'superbowl', 'it', 'are', 'keeppounding', 'that', 'be', 'will', 'k', 'game', 'amp', 'we', 'our', 'my', 'got', 'cam', 'https\205', 'big', 'if', 'but', 'from', 'just', 'time', 'now', 'all', 'have', 'up', 'who', 'out', 'show', 'sbfanvote', 'peyton', 'so', 'chance', 'was', 'why', 'watch', 'see', 'like', 'winning', 'not', 'commercial', 'get', 'by', 'coldplay', 'more', 'think', 'what', 'go', 'one', 'do', 'over', 'here', 'halftime', 'away', 'good', 'me', 'gaga', 'lady', 'i\222ve', 'how', 'httpstco\205', 'ready', 'manning', 'pepsihalftime', 'could', 'wearing', 'ad', 'during', 'its', 'about', 'beyonce', 'doritos', 'httpst\205', 'an', 'day', 'going', 'anthem', 'after', 'national', 'than', 'team', 'want', 'gonna', 'some', 'his', 'denver', 'best', 'ladygaga', 'can', 'im', 'pass', 'today', 'enter', 'socks', 'avosinspace', 'shoes', 'sandals', 'reporter', 'jeans', 'biggame', 'httpstcoqdraydnsb', 'brunomars', 'tomorrow', 'sweepstakes', 'check', 'when', 'avosfrommexico', 'beyonc\351', 'sunday', 'great', 'seo', 'mvp', 'performance', 'as', 'love', 'they', 'new', 'field', 'did', 'congrats', 'tmobile', 'still', 'no', 'drake', 'tonight', 'special', 'yougotcarriered', 'he', 'last', 'has', 'too', 'superbowlsunday', 'lets', 'make')
rawFreq <- c(1984423, 1700564, 1101899, 588803, 428598, 388390, 326464, 322154, 296673, 292468, 266847, 265392, 245718, 234509, 233618, 233157, 215919, 212620, 202765, 183808, 182673, 176209, 175996, 172636, 146487, 143345, 142812, 136649, 134436, 130056, 128878, 126930, 116187, 115213, 114805, 108680, 103023, 88247, 88099, 84442, 82291, 82116, 79843, 78986, 77616, 77562, 75405, 73245, 70581, 68565, 68325, 66217, 66030, 64489, 63026, 62986, 62878, 61111, 60982, 59658, 59629, 57424, 56911, 56585, 56455, 56182, 55496, 55237, 54729, 53962, 52840, 50489, 46303, 46216, 45832, 45569, 44364, 43338, 42667, 42008, 41743, 41566, 41473, 40003, 39888, 39808, 39421, 38575, 38498, 37085, 35345, 32997, 31292, 31018, 30832, 29258, 29183, 28980, 28908, 27361, 27283, 23367, 23183, 22575, 22456, 21964, 21095, 20530, 20213, 19514, 19428, 19115, 18887, 18483, 18120, 16901, 14239, 14110, 13475, 13424, 13329, 13326, 13304, 13231, 13221, 13194, 12641, 12225, 11635, 11502, 11362, 11341, 11293, 11102, 10986, 10660, 10637, 10331, 10136, 10040, 9963, 9745, 9616, 9495, 9468, 9397, 9384, 9368, 9284, 8914, 8732, 8719, 8697, 8629, 8536, 8379, 8316, 7977, 7970)
rawRank <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159)
sb_words <- data.frame(word=rawWords, freq=rawFreq, rank=rawRank)
# Examine sb_words
head(sb_words)
## word freq rank
## 1 sb 1984423 1
## 2 rt 1700564 2
## 3 the 1101899 3
## 4 to 588803 4
## 5 a 428598 5
## 6 for 388390 6
# Create expectations
# sb_words$expectations <- sb_words %$%
# {freq / rank}
# Probably should be something more like this
sb_words$expectations <- sb_words %$%
{freq[1] / rank}
# Create metrics plot
sb_plot <- metricsgraphics::mjs_plot(sb_words, x = rank, y = freq, show_rollover_text = FALSE)
# Add 1st line
sb_plot <- metricsgraphics::mjs_line(sb_plot)
# Add 2nd line
sb_plot <- metricsgraphics::mjs_add_line(sb_plot, expectations)
# Add legend
sb_plot <- metricsgraphics::mjs_add_legend(sb_plot, legend = c("Frequency", "Expectation"))
# Display plot
sb_plot
# So far you have learned the basic components needed for assessing positive or negative intent in text
# Remember the following points so you can feel confident in your results.
# The subjectivity lexicon is a predefined list of words associated with emotions or positive/negative feelings.
# You don't have to list every word in a subjectivity lexicon because Zipf's law describes human expression.
# A quick way to get started is to use the polarity() function which has a built-in subjectivity lexicon
# The function scans the text to identify words in the lexicon
# It then creates a word group around the identified positive or negative subjectivity word
# Within the group valence shifters adjust the score
# Valence shifters are words that amplify or negate the emotional intent of the subjectivity word
# For example, "well known" is positive while "not well known" is negative
# Here "not" is a negating term and reverses the emotional intent of "well known."
# In contrast, "very well known" employs an amplifier increasing the positive intent
# The polarity() function then calculates a score using subjectivity terms, valence shifters and the total number of words in the passage
# This exercise demonstrates a simple polarity calculation
# In the next video we look under the hood of polarity() for more detail
# We've defined positive to denote a positive statement
# Example statements
positive <- "DataCamp courses are good for learning"
# Calculate polarity of both statements
(pos_score <- polarity(positive))
## all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 all 1 6 0.408 NA NA
# Get counts
(pos_counts <- counts(pos_score))
## all wc polarity pos.words neg.words text.var
## 1 all 6 0.408 good - DataCamp courses are good for learning
# Number of positive words
n_good <- length(pos_counts$pos.words[[1]])
# Total number of words
n_words <- pos_counts$wc
# Verify polarity score
n_good / sqrt(n_words)
## [1] 0.4082483
# Of course just positive and negative words aren't enough
# In this exercise you will learn about valence shifters which tell you about the author's emotional intent
# Previously you applied polarity() to text without valence shifters. In this example you will see amplifers and negating words in action
# Recall that an amplifying word adds 0.8 to a positive word in polarity() so the positive score becomes 1.8
# For negative words 0.8 is subtracted so the total becomes -1.8
# Then the score is divided by the square root of the total number of words
# Consider the following example from Frank Sinatra: "It was a very good year"
# "Good" equals 1 and "very" adds another 0.8
# So, 1.8/sqrt(6) results in 0.73 polarity
# A negating word such as "not" will inverse the subjectivity score
# Consider the following example from Bobby McFerrin: "Don't worry Be Happy"
# "worry is now 1 due to the negation "don't."
# Adding the "happy", +1, equals 2
# With 4 total words, 2 / sqrt(4) equals a polarity score of 1
conversation <- data.frame(student=c('Martijn', 'Nick', 'Nicole'),
text=c('This restaurant is never bad', 'The lunch was very good', 'It was awful I got food poisoning and was extremely ill'),
stringsAsFactors=TRUE
)
# Examine conversation
conversation
## student text
## 1 Martijn This restaurant is never bad
## 2 Nick The lunch was very good
## 3 Nicole It was awful I got food poisoning and was extremely ill
# Polarity - All
polarity(conversation$text)
## all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 all 3 21 0.317 0.565 0.561
# Polarity - Grouped
student_pol <- conversation %$%
polarity(text, student)
# Student results
scores(student_pol)
## student total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 Martijn 1 5 0.447 NA NA
## 2 Nick 1 5 0.805 NA NA
## 3 Nicole 1 11 -0.302 NA NA
# Sentence by sentence
counts(student_pol)
## student wc polarity pos.words neg.words text.var
## 1 Martijn 5 0.447 - bad This restaurant is never bad
## 2 Nick 5 0.805 good - The lunch was very good
## 3 Nicole 11 -0.302 - awful It was awful I got food poisoning and was extremely ill
# qdap plot
plot(student_pol)
## Warning: `show_guide` has been deprecated. Please use `show.legend`
## instead.
## Warning: `show_guide` has been deprecated. Please use `show.legend`
## instead.
## Warning: Removed 3 rows containing missing values (geom_errorbarh).
# Even with Zipf's law in action, you will still need to adjust lexicons to fit the text source (for example twitter versus legal documents) or the author's demographics (teenage girl versus middle aged man)
# This exercise demonstrates the explicit components of polarity() so you can change it if needed
# In Trey Songz "Lol :)" song there is a lyric "LOL smiley face, LOL smiley face."
# In the basic polarity() function, "LOL" is not defined as positive
# However, "LOL" stands for "Laugh Out Loud" and should be positive
# As a result, you should adjust the lexicon to fit the text's context which includes pop-culture slang
# If your analysis contains text from a specific channel (Twitter's "LOL"), location (Boston's "Wicked Good"), or age group (teenagers "sick") you will likely have to adjust the lexicon
# In this exercise you are not adjusting the subjectivity lexicon or qdap dictionaries containing valence shifters
# Instead you are examining the existing word data frame objects so you can change them in the following exercise
# We've created text containing two excerpts from Beyoncé's "Crazy in Love" lyrics for the exercise
# Examine the key.pol
key.pol
## x y
## 1: a plus 1
## 2: abnormal -1
## 3: abolish -1
## 4: abominable -1
## 5: abominably -1
## ---
## 6775: zealously -1
## 6776: zenith 1
## 6777: zest 1
## 6778: zippy 1
## 6779: zombie -1
# Negators
negation.words
## [1] "ain't" "aren't" "can't" "couldn't" "didn't"
## [6] "doesn't" "don't" "hasn't" "isn't" "mightn't"
## [11] "mustn't" "neither" "never" "no" "nobody"
## [16] "nor" "not" "shan't" "shouldn't" "wasn't"
## [21] "weren't" "won't" "wouldn't"
# Amplifiers
amplification.words
## [1] "acute" "acutely" "certain" "certainly"
## [5] "colossal" "colossally" "deep" "deeply"
## [9] "definite" "definitely" "enormous" "enormously"
## [13] "extreme" "extremely" "great" "greatly"
## [17] "heavily" "heavy" "high" "highly"
## [21] "huge" "hugely" "immense" "immensely"
## [25] "incalculable" "incalculably" "massive" "massively"
## [29] "more" "particular" "particularly" "purpose"
## [33] "purposely" "quite" "real" "really"
## [37] "serious" "seriously" "severe" "severely"
## [41] "significant" "significantly" "sure" "surely"
## [45] "true" "truly" "vast" "vastly"
## [49] "very"
# De-amplifiers
deamplification.words
## [1] "barely" "faintly" "few" "hardly"
## [5] "little" "only" "rarely" "seldom"
## [9] "slightly" "sparesly" "sporadically" "very few"
## [13] "very little"
text <- data.frame(speaker=c("beyonce", "jay_z"),
words=c("I know I dont understand Just how your love can do what no one else can", "They cant figure him out they like hey, is he insane"),
stringsAsFactors=TRUE
)
# Examine
text
## speaker
## 1 beyonce
## 2 jay_z
## words
## 1 I know I dont understand Just how your love can do what no one else can
## 2 They cant figure him out they like hey, is he insane
# Explicit polarity parameters
polarity(
text.var = text$words,
grouping.var = text$speaker,
polarity.frame = key.pol,
negators = negation.words,
amplifiers = amplification.words,
deamplifiers = deamplification.words
)
## speaker total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 beyonce 1 16 0.25 NA NA
## 2 jay_z 1 11 0.00 NA NA
# Here you will adjust the negative words to account for the specific text. You will then compare the basic and custom polarity() scores
# A popular song from Twenty One Pilots is called "Stressed Out".
# If you scan the lyrics of this song, you will observe the song is about youthful nostalgia
# Overall, most people would say the polarity is negative
# Repeatedly the lyrics mention stress, fears and pretending
# Let's compare the song lyrics using the default subjectivity lexicon and also a custom one
# To start, you need to verify the key.pol subjectivity lexicon does not already have the term you want to add
# One way to check is with grep()
# The grep() function returns the row containing characters that match a search pattern
# data_frame[grep("search_pattern", data_frame$column), ]
# After verifying the slang or new word is not already in the key.pol lexicon you need to add it
# The code below uses sentiment_frame() to construct the new lexicon
# Within the code sentiment_frame() accepts the original positive word vector, positive.words
# Next, the original negative.words are concatenated to "smh" and "kappa", both considered negative slang
# Although you can declare the positive and negative weights, the default is 1 and -1 so they are not included below
# custom_pol <- sentiment_frame(positive.words, c(negative.words, "hate", "pain"))
stressed_out <- "I wish I found some better sounds no ones ever heard\nI wish I had a better voice that sang some better words\nI wish I found some chords in an order that is new\nI wish I didnt have to rhyme every time I sang\nI was told when I get older all my fears would shrink\nBut now Im insecure and I care what people think\nMy names Blurryface and I care what you think\nMy names Blurryface and I care what you think\nWish we could turn back time, to the good old days\nWhen our momma sang us to sleep but now were stressed out\nWish we could turn back time to the good old days\nWhen our momma sang us to sleep but now were stressed out\nWere stressed out\nSometimes a certain smell will take me back to when I was young\nHow come Im never able to identify where its coming from\nId make a candle out of it if I ever found it\nTry to sell it never sell out of it Id probably only sell one\nItd be to my brother, cause we have the same nose\nSame clothes homegrown a stones throw from a creek we used to roam\nBut it would remind us of when nothing really mattered\nOut of student loans and tree-house homes we all would take the latter\nMy names Blurryface and I care what you think\nMy names Blurryface and I care what you think\nWish we could turn back time, to the good old days\nWhen our momma sang us to sleep but now were stressed out\nWish we could turn back time, to the good old days\nWhen our momma sang us to sleep but now were stressed out\nWe used to play pretend, give each other different names\nWe would build a rocket ship and then wed fly it far away\nUsed to dream of outer space but now theyre laughing at our face #\nSaying, Wake up you need to make money\nYeah\nWe used to play pretend give each other different names\nWe would build a rocket ship and then wed fly it far away\nUsed to dream of outer space but now theyre laughing at our face\nSaying, Wake up, you need to make money\nYeah\nWish we could turn back time, to the good old days\nWhen our momma sang us to sleep but now were stressed out\nWish we could turn back time, to the good old days\nWhen our momma sang us to sleep but now were stressed out\nUsed to play pretend, used to play pretend bunny\nWe used to play pretend wake up, you need the money\nUsed to play pretend used to play pretend bunny\nWe used to play pretend, wake up, you need the money\nWe used to play pretend give each other different names\nWe would build a rocket ship and then wed fly it far away\nUsed to dream of outer space but now theyre laughing at our face\nSaying, Wake up, you need to make money\nYeah"
# stressed_out has been pre-defined
head(stressed_out)
## [1] "I wish I found some better sounds no ones ever heard\nI wish I had a better voice that sang some better words\nI wish I found some chords in an order that is new\nI wish I didnt have to rhyme every time I sang\nI was told when I get older all my fears would shrink\nBut now Im insecure and I care what people think\nMy names Blurryface and I care what you think\nMy names Blurryface and I care what you think\nWish we could turn back time, to the good old days\nWhen our momma sang us to sleep but now were stressed out\nWish we could turn back time to the good old days\nWhen our momma sang us to sleep but now were stressed out\nWere stressed out\nSometimes a certain smell will take me back to when I was young\nHow come Im never able to identify where its coming from\nId make a candle out of it if I ever found it\nTry to sell it never sell out of it Id probably only sell one\nItd be to my brother, cause we have the same nose\nSame clothes homegrown a stones throw from a creek we used to roam\nBut it would remind us of when nothing really mattered\nOut of student loans and tree-house homes we all would take the latter\nMy names Blurryface and I care what you think\nMy names Blurryface and I care what you think\nWish we could turn back time, to the good old days\nWhen our momma sang us to sleep but now were stressed out\nWish we could turn back time, to the good old days\nWhen our momma sang us to sleep but now were stressed out\nWe used to play pretend, give each other different names\nWe would build a rocket ship and then wed fly it far away\nUsed to dream of outer space but now theyre laughing at our face #\nSaying, Wake up you need to make money\nYeah\nWe used to play pretend give each other different names\nWe would build a rocket ship and then wed fly it far away\nUsed to dream of outer space but now theyre laughing at our face\nSaying, Wake up, you need to make money\nYeah\nWish we could turn back time, to the good old days\nWhen our momma sang us to sleep but now were stressed out\nWish we could turn back time, to the good old days\nWhen our momma sang us to sleep but now were stressed out\nUsed to play pretend, used to play pretend bunny\nWe used to play pretend wake up, you need the money\nUsed to play pretend used to play pretend bunny\nWe used to play pretend, wake up, you need the money\nWe used to play pretend give each other different names\nWe would build a rocket ship and then wed fly it far away\nUsed to dream of outer space but now theyre laughing at our face\nSaying, Wake up, you need to make money\nYeah"
# Basic lexicon score
polarity(stressed_out)
## all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 all 1 518 -0.255 NA NA
# Check the subjectivity lexicon
key.pol[grep("stress", x)]
## x y
## 1: distress -1
## 2: distressed -1
## 3: distressing -1
## 4: distressingly -1
## 5: mistress -1
## 6: stress -1
## 7: stresses -1
## 8: stressful -1
## 9: stressfully -1
# New lexicon
custom_pol <- sentiment_frame(positive.words, c(negative.words, "stressed", "turn back"))
# Compare new score
polarity(stressed_out, polarity.frame = custom_pol)
## all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 all 1 518 -0.826 NA NA
Chapter 2 - Sentiment Analysis the tidytext Way
Plutchik’s wheel of emotion, polarity vs. sentiment:
Bing lexicon with inner join:
AFINN and NRC methodologies in more detail:
Example code includes:
# There is a growing number of "tidyverse" R packages
# The tidyverse is a collection of R packages that share common philosophies and are designed to work together
# This chapter covers some tidy functions to manipulate data
# In fact, in this exercise you will compare a DTM to a tidy text data frame called a tibble
# Within the tidyverse each observation is a single row in a data frame
# That makes working in different packages much easier since the fundamental data structure is the same
# Parts of this course borrow heavily from the tidytext package which uses this data organization
# To change a DTM to a tidy format use tidy() from the broom package.
# tidy_format <- tidy(Document_Term_Matrix)
# This exercise uses text from the Greek tragedy, Agamemnon
# Agamemnon is a story about marital infidelity and murder
# You can download a copy here (http://www.gutenberg.org/ebooks/14417?msg=welcome_stranger)
# We've already created a clean DTM called ag_dtm for this exercise.
clean_corpus <- function(corpus){
corpus <- tm_map(corpus, content_transformer(replace_abbreviation))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, c(stopwords("en")))
return(corpus)
}
agRawText <- readLines("./RInputFiles/pg14417.txt")
agSource <- VectorSource(agRawText[318:3430])
agCorpus <- VCorpus(agSource)
agClean <- clean_corpus(agCorpus)
ag_dtm <- DocumentTermMatrix(agClean)
# As matrix
ag_dtm_m <- as.matrix(ag_dtm)
# Examine line 2206 and columns 245:250 (edited to 2206 and 308:313)
ag_dtm_m[2206, 308:313]
## bleed bleeds blent bless blessã¨d blessing
## 0 0 0 1 0 0
# Tidy up the DTM (function does not work here . . . )
# ag_tidy <- broom::tidy(ag_dtm)
ag_tidy <- tibble::tibble(document=ag_dtm$dimnames$Docs[ag_dtm$i],
term=ag_dtm$dimnames$Terms[ag_dtm$j],
count=ag_dtm$v
)
# Examine tidy with a word you saw
ag_tidy[824:828, ]
## # A tibble: 5 x 3
## document term count
## <chr> <chr> <dbl>
## 1 234 bleeds 1.00
## 2 234 sleepeth 1.00
## 3 235 comes 1.00
## 4 235 will 1.00
## 5 235 wisdom 1.00
# So far you have used a single lexicon
# Now we will transition to using three, each measuring sentiment in different ways
# The tidytext package contains a data frame called sentiments
# The data frame contains over 23000 terms from three different subjectivity lexicons with corresponding information
# Here are some example rows from the sentiments data frame
# Notice the tidy format
# Each word is a row and NAs fill in columns that are not applicable
# The "AFINN" lexicon scores words from 5 to -5
# The "Bing" lexicon is the same lexicon used in qdap's polarity() function
# "Bing" words are only labeled as positive or negative
# The "NRC" lexicon has distinct emotional classes covering Plutchik's Wheel and positive and negative
# Subset to AFINN
afinn_lex <- tidytext::get_sentiments("afinn")
# Count AFINN scores
afinn_lex %>%
count(score)
## # A tibble: 11 x 2
## score n
## <int> <int>
## 1 -5 16
## 2 -4 43
## 3 -3 264
## 4 -2 965
## 5 -1 309
## 6 0 1
## 7 1 208
## 8 2 448
## 9 3 172
## 10 4 45
## 11 5 5
# Subset to nrc
nrc_lex <- tidytext::get_sentiments("nrc")
# Print nrc_lex
nrc_lex
## # A tibble: 13,901 x 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ... with 13,891 more rows
# Make the nrc counts object
nrc_counts <- nrc_lex %>%
count(sentiment)
# Barplot
ggplot(nrc_counts, aes(x = sentiment, y = n))+
geom_bar(stat = "identity") +
ggthemes::theme_gdocs()
# The Bing lexicon labels words as positive or negative
# The next three exercises let you interact with this specific lexicon
# Instead of using filter() to extract a lexicon this exercise uses get_sentiments() which accepts a string such as "afinn", "bing", "nrc", or "loughran"
# Now that you understand the basics of an inner join, let's apply this to the "Bing" lexicon
# Keep in mind the inner_join() function comes from dplyr and the sentiments object is from tidytext
# The inner join workflow:
# Obtain the correct lexicon using either filter() or get_sentiments().
# Pass the lexicon and the tidy text data to inner_join().
# In order for inner_join() to work there must be a shared column name. If there are no shared column names, declare them with an additional parameter, by equal to c with column names like below
# object <- x %>%
# inner_join(y, by = c("column_from_x" = "column_from_y")
# We've loaded ag_txt containing the first 100 lines from Agamemnon and ag_tidy which is the tidy version
ag_txt <- agRawText[agRawText != ""][1:100]
# Qdap polarity
polarity(ag_txt)
## Warning in polarity(ag_txt):
## Some rows contain double punctuation. Suggested use of `sentSplit` function.
## all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 all 100 1038 -0.093 0.364 -0.255
# Get Bing lexicon
bing <- tidytext::get_sentiments("bing")
# Join text to lexicon
ag_bing_words <- inner_join(ag_tidy, bing, by = c("term" = "word"))
# Examine
ag_bing_words
## # A tibble: 1,641 x 4
## document term count sentiment
## <chr> <chr> <dbl> <chr>
## 1 10 waste 1.00 negative
## 2 11 respite 1.00 positive
## 3 13 well 1.00 positive
## 4 14 lonely 1.00 negative
## 5 16 great 1.00 positive
## 6 16 heavenly 1.00 positive
## 7 22 dark 1.00 negative
## 8 23 fear 1.00 negative
## 9 24 warning 1.00 negative
## 10 25 well 1.00 positive
## # ... with 1,631 more rows
# Get counts by sentiment
ag_bing_words %>%
count(sentiment)
## # A tibble: 2 x 2
## sentiment n
## <chr> <int>
## 1 negative 1033
## 2 positive 608
# The spread() function spreads a key-value pair across multiple columns
# In this case key is the sentiment and the values are the frequency of positive or negative terms for each line
# Using spread() changes the data so that each row now has positive and negative values, even if it is 0
# In this exercise, your R session has m_dick_tidy which contains the book Moby Dick and bing, containing the lexicon similar to the previous exercise
all_books <- readRDS("./RInputFiles/all_books.rds")
m_dick_tidy <- all_books[all_books$book=="moby_dick", c("term", "document", "count")]
m_dick_tidy
## # A tibble: 109,996 x 3
## term document count
## <chr> <chr> <dbl>
## 1 chapter 2 1.00
## 2 loomings 2 1.00
## 3 agonever 5 1.00
## 4 call 5 1.00
## 5 ishmael 5 1.00
## 6 long 5 1.00
## 7 mind 5 1.00
## 8 preciselyhaving 5 1.00
## 9 some 5 1.00
## 10 years 5 1.00
## # ... with 109,986 more rows
# Inner join
moby_lex_words <- inner_join(m_dick_tidy, bing, by = c("term" = "word"))
moby_lex_words <- moby_lex_words %>%
# Set index to numeric document
mutate(index = as.numeric(document))
moby_count <- moby_lex_words %>%
# Count by sentiment, index
count(sentiment, index)
# Examine the counts
moby_count
## # A tibble: 10,594 x 3
## sentiment index n
## <chr> <dbl> <int>
## 1 negative 9.00 1
## 2 negative 11.0 1
## 3 negative 22.0 1
## 4 negative 41.0 1
## 5 negative 42.0 2
## 6 negative 44.0 1
## 7 negative 56.0 1
## 8 negative 64.0 1
## 9 negative 66.0 1
## 10 negative 68.0 1
## # ... with 10,584 more rows
moby_spread <- moby_count %>%
# Spread sentiments
tidyr::spread(sentiment, n, fill = 0)
# Review the spread data
moby_spread
## # A tibble: 9,229 x 3
## index negative positive
## <dbl> <dbl> <dbl>
## 1 9.00 1.00 0
## 2 11.0 1.00 0
## 3 13.0 0 1.00
## 4 17.0 0 1.00
## 5 19.0 0 1.00
## 6 22.0 1.00 0
## 7 24.0 0 1.00
## 8 25.0 0 1.00
## 9 31.0 0 2.00
## 10 35.0 0 2.00
## # ... with 9,219 more rows
# The last Bing lexicon exercise!
# We started with this lexicon since its similar to the results in Chapter 1
# In this exercise you will use the pipe operator (%>%) to create a timeline of the sentiment in Moby Dick
# In the end you will also create a simple visual following the code structure below
# The next chapter goes into more depth for visuals
# Your R session has moby as your text and bing as your lexicon
# After this exercise you should know Is Moby Dick a happy or sad book?
moby_polarity <- m_dick_tidy %>%
mutate(index = as.numeric(document)) %>%
# Inner join to lexicon
inner_join(bing, by = c("term" = "word")) %>%
# Count the sentiment scores
count(sentiment, index) %>%
# Spread the sentiment into positive and negative columns
tidyr::spread(sentiment, n, fill = 0) %>%
# Add polarity column
mutate(polarity = positive - negative)
# Plot polarity vs. index
ggplot(moby_polarity, aes(x=index, y=polarity)) +
# Add a smooth trend curve
geom_smooth()
## `geom_smooth()` using method = 'gam'
# Now we transition to the AFINN lexicon
# The AFINN lexicon has numeric values from 5 to -5, not just positive or negative
# Unlike the Bing lexicon's sentiment, the AFINN lexicon's sentiment score column is called score
# As before, you apply inner_join() then count()
# Next, to sum the scores of each line, we use dplyr's group_by() and summarize() functions
# The group_by() function takes an existing data frame and converts it into a grouped data frame where operations are performed "by group"
# Then, the summarize() function lets you calculate a value for each group in your data frame using a function that aggregates data, like sum() or mean()
# So, in our case we can do something like
# data_frame %>%
# group_by(book_line) %>%
# summarize(total_score = sum(book_line))
# In the tidy version of Huckleberry Finn, line 9703 contains words "best", "ever", "fun", "life" and "spirit". "best" and "fun" have AFINN scores of 3 and 4 respectively
# After aggregating, line 9703 will have a total score of 7
# The afinn object contains the AFINN lexicon
# The huck object is a tidy version of Mark Twain's Adventures of Huckleberry Finn for analysis
# Line 5400 is All the loafers looked glad; I reckoned they was used to having fun out of Boggs
# Stopwords and punctuation have already been removed in the dataset
huck <- all_books[all_books$book=="huck_finn", c("term", "document", "count")] %>%
mutate(document=as.numeric(document)) %>%
rename(line=document)
huck
## # A tibble: 55,198 x 3
## term line count
## <chr> <dbl> <dbl>
## 1 finn 1.00 1.00
## 2 ïhuckleberry 1.00 1.00
## 3 ago 3.00 1.00
## 4 fifty 3.00 1.00
## 5 forty 3.00 1.00
## 6 mississippi 3.00 1.00
## 7 scene 3.00 1.00
## 8 the 3.00 1.00
## 9 time 3.00 1.00
## 10 valley 3.00 1.00
## # ... with 55,188 more rows
# See abbreviated line 5400
huck %>% filter(line == 5400)
## # A tibble: 7 x 3
## term line count
## <chr> <dbl> <dbl>
## 1 all 5400 1.00
## 2 fun 5400 1.00
## 3 glad 5400 1.00
## 4 loafers 5400 1.00
## 5 looked 5400 1.00
## 6 reckoned 5400 1.00
## 7 used 5400 1.00
# What are the scores of the sentiment words?
afinn_lex %>% filter(word %in% c("fun", "glad"))
## # A tibble: 2 x 2
## word score
## <chr> <int>
## 1 fun 4
## 2 glad 3
huck_afinn <- huck %>%
# Inner Join to AFINN lexicon
inner_join(afinn_lex, by = c("term" = "word")) %>%
# Count by score and line
count(score, line)
huck_afinn_agg <- huck_afinn %>%
# Group by line
group_by(line) %>%
# Sum scores by line
summarize(total_score = sum(score))
# Filter huck_afinn_agg
huck_afinn_agg %>% filter(line == 5400)
## # A tibble: 1 x 2
## line total_score
## <dbl> <int>
## 1 5400 7
# Plot total score vs. line
ggplot(huck_afinn_agg, aes(x=line, y=total_score)) +
# Add a smooth trend curve
geom_smooth()
## `geom_smooth()` using method = 'gam'
# Last but not least, you get to work with the NRC lexicon which labels words across multiple emotional states
# Remember Plutchik's wheel of emotion? The NRC lexicon tags words according to Plutchik's 8 emotions plus positive/negative
# In this exercise there is a new operator, %in%, which matches a vector to another
# In the code below %in% will return FALSE, FALSE, TRUE
# This is because within some_vec, 1 and 2 are not found within some_other_vector but 3 is found and returns TRUE
# The %in% is useful to find matches
# We've created oz which is the tidy version of The Wizard of Oz along with nrc containing the "NRC" lexicon with renamed columns
# Switched to Julius Caesar since it is what is easily available in the dataset
jc <- all_books[all_books$book=="julius_caesar", c("term", "document", "count")] %>%
mutate(document=as.numeric(document)) %>%
rename(line=document)
jc
## # A tibble: 13,165 x 3
## term line count
## <chr> <dbl> <dbl>
## 1 etext 1.00 1.00
## 2 file 1.00 1.00
## 3 gutenberg 1.00 1.00
## 4 ïthis 1.00 1.00
## 5 presented 1.00 1.00
## 6 project 1.00 1.00
## 7 cooperation 2.00 1.00
## 8 inc 2.00 1.00
## 9 library 2.00 2.00
## 10 world 2.00 1.00
## # ... with 13,155 more rows
# Join text and lexicon
jc_nrc <- inner_join(jc, nrc_lex, by = c("term" = "word"))
# DataFrame of tally
jc_plutchik <- jc_nrc %>%
# Only consider Plutchik sentiments
filter(!sentiment %in% c("positive", "negative")) %>%
# Group by sentiment
group_by(sentiment) %>%
# Get total count by sentiment
summarize(total_count = sum(count))
# Plot the counts
ggplot(jc_plutchik, aes(x = sentiment, y = total_count)) +
# Add a column geom
geom_col()
Chapter 3 - Visualizing Sentiment
Parlor trick or worthwhile?
Introduction using sentiment analysis:
Interpreting visualizations:
Example code includes:
# Sometimes you want to track sentiment over time
# For example, during an ad campaign you could track brand sentiment to see the campaign's effect
# You saw a few examples of this at the end of the last chapter
# In this exercise you'll recap the workflow for exploring sentiment over time using the novel Moby Dick
# One should expect that happy moments in the book would have more positive words than negative
# Conversely dark moments and sad endings should use more negative language
# You'll also see some tricks to make your sentiment time series more visually appealling
moby_polarity <- m_dick_tidy %>%
mutate(index = as.numeric(document)) %>%
# Inner join to the lexicon
inner_join(bing, by=c("term" = "word")) %>%
# Count by sentiment, index
count(sentiment, index) %>%
# Spread sentiments
tidyr::spread(sentiment, n, fill=0) %>%
mutate(
# Add polarity field
polarity = positive - negative,
# Add line number field
line_number = row_number()
)
# Plot
ggplot(moby_polarity, aes(x=line_number, y=polarity)) +
geom_smooth() +
geom_hline(yintercept = 0, color = "red") +
ggtitle("Moby Dick Chronological Polarity") +
ggthemes::theme_gdocs()
## `geom_smooth()` using method = 'gam'
# One of the easiest ways to explore data is with a frequency analysis
# Although not difficult, in sentiment analysis this simple method can be surprisingly illuminating
# Specifically, you will build a barplot. In this exercise you are once again working with moby and bing to construct your visual
# Inner join without renamed columns
moby_sents <- inner_join(m_dick_tidy, bing, by = c("term" = "word"))
# Tidy sentiment calculation
moby_tidy_sentiment <- moby_sents %>%
count(term, sentiment, wt = count) %>%
tidyr::spread(sentiment, n, fill = 0) %>%
mutate(polarity = positive - negative)
# Review
moby_tidy_sentiment
## # A tibble: 2,362 x 4
## term negative positive polarity
## <chr> <dbl> <dbl> <dbl>
## 1 abominable 3.00 0 -3.00
## 2 abominate 1.00 0 -1.00
## 3 abomination 1.00 0 -1.00
## 4 abound 0 3.00 3.00
## 5 abruptly 2.00 0 -2.00
## 6 absence 5.00 0 -5.00
## 7 absurd 3.00 0 -3.00
## 8 absurdly 1.00 0 -1.00
## 9 abundance 0 3.00 3.00
## 10 abundant 0 2.00 2.00
## # ... with 2,352 more rows
# Subset
moby_tidy_small <- moby_tidy_sentiment %>%
filter(abs(polarity) >= 50)
# Add polarity
moby_tidy_pol <- moby_tidy_small %>%
mutate(
pol = ifelse(polarity > 0, "positive", "negative")
)
# Plot
ggplot(
moby_tidy_pol,
aes(reorder(term, polarity), polarity, fill = pol)
) +
geom_bar(stat = "identity") +
ggtitle("Moby Dick: Sentiment Word Frequency") +
ggthemes::theme_gdocs() +
theme(axis.text.x = element_text(angle = 90, vjust = -0.1))
# Now that you have seen how polarity can be used to divide a corpus, let's do it!
# This code will walk you through dividing a corpus based on sentiment so you can peer into the informaton in subsets instead of holistically
# Your R session has oz_pol which was created by applying polarity() to "The Wonderful Wizard of Oz."
# For simplicity's sake, we created a simple custom function called pol_subsections() which will divide the corpus by polarity score
# First, the function accepts a data frame with each row being a sentence or document of the corpus
# The data frame is subset anywhere the polarity values are greater than or less than 0
# Finally, the positive and negative sentences, non-zero polarities, are pasted with parameter collapse so that the terms are grouped into a single corpus
# Lastly, the two documents are concatenated into a single vector of two distinct documents
pol_subsections <- function(df) {
x.pos <- subset(df$text, df$polarity > 0)
x.neg <- subset(df$text, df$polarity < 0)
x.pos <- paste(x.pos, collapse = " ")
x.neg <- paste(x.neg, collapse = " ")
all.terms <- c(x.pos, x.neg)
return(all.terms)
}
# At this point you have omitted the neutral sentences and want to focus on organizing the remaining text
# In this exercise we use the %>% operator again to forward objects to functions
# After some simple cleaning use comparison.cloud() to make the visual
# Using Agamemnon instead since easily available
ag_pol <- polarity(agRawText[318:3430])
## Warning in polarity(agRawText[318:3430]):
## Some rows contain double punctuation. Suggested use of `sentSplit` function.
# Add scores to each document line in a data frame
ag_df <- ag_pol$all %>%
select(text = text.var, polarity = polarity)
# Custom function
all_terms <- pol_subsections(ag_df)
# Make a corpus
all_corpus <- all_terms %>%
VectorSource() %>%
VCorpus()
# Basic TDM
all_tdm <- TermDocumentMatrix(
all_corpus,
control = list(
removePunctuation = TRUE,
stopwords = stopwords(kind = "en")
)
) %>%
as.matrix() %>%
set_colnames(c("positive", "negative"))
# Make a comparison cloud
wordcloud::comparison.cloud(
all_tdm,
max.words = 50,
colors = c("darkgreen", "darkred")
)
# In this exercise you go beyond subsetting on positive and negative language
# Instead you will subset text by each of the 8 emotions in Plutchik's emotional wheel to construct a visual
# With this approach you will get more clarity in word usage by mapping to a specific emotion instead of just positive or negative.
# Using the tidytext subjectivity lexicon, "nrc", you perform an inner_join() with your text
# The "nrc" lexicon has the 8 emotions plus positive and negative term classes
# So you will have to drop positive and negative words after performing your inner_join()
# One way to do so is with the negation, !, and grepl()
# The "Global Regular Expression Print Logical," grepl(), function will return a True or False if a string pattern is identified in each row
# In this exercise you will search for positive OR negative using the | operator, representing "or" as shown below
# Often this straight line is above the enter key on a keyboard
# Since the ! negation precedes grepl(), the T or F is switched so the "positive|negative" is dropped instead of kept
# Next you apply count() on the identified words along with spread() to get the data frame organized
# This exercise introduces rownames()
# This function declares the names of rows in a data frame
# It behaves a bit differently because rownames() is passed the object gaining the row names on the left side of <-
# On the right side the character vector of names is declared such as data_frame[, 1]. For instance:
# rownames(data_frame) <- vector_of_names
# After setting row names you will create a more varied comparison.cloud()
# NOTE - appears NRC is already converted to have 'term' rather than 'word'
# Inner join
moby_sentiment <- inner_join(m_dick_tidy, nrc_lex, by = c("term" = "word"))
# Drop positive or negative
moby_pos_neg <- moby_sentiment %>%
filter(!grepl("positive|negative", sentiment))
# Count terms by sentiment then spread
moby_tidy <- moby_pos_neg %>%
count(sentiment, term = term) %>%
tidyr::spread(sentiment, n, fill = 0) %>%
as.data.frame()
# Set row names
rownames(moby_tidy) <- moby_tidy[, 1]
# Drop terms column
moby_tidy[, 1] <- NULL
# Examine
head(moby_tidy)
## anger anticipation disgust fear joy sadness surprise trust
## abandon 0 0 0 3 0 3 0 0
## abandoned 7 0 0 7 0 7 0 0
## abandonment 2 0 0 2 0 2 2 0
## abhorrent 1 0 1 1 0 0 0 0
## abominable 0 0 3 3 0 0 0 0
## abomination 1 0 1 1 0 0 0 0
# Comparison cloud
wordcloud::comparison.cloud(moby_tidy, max.words = 50, title.size = 1.5)
# Another way to slice your text is to understand how much of the document(s) are made of positive or negative words
# For example a restaurant review may have some positive aspects such as "the food was good" but then continue to add "the restaurant was dirty, the staff was rude and parking was awful."
# As a result, you may want to understand how much of a document is dedicated to positive vs negative language
# In this example it would have a higher negative percentage compared to positive
# One method for doing so is to count() the positive and negative words then divide by the number of subjectivity words identified
# In the restaurant review example, "good" would count as 1 positive and "dirty," "rude," and "awful" count as 3 negative terms
# A simple calculation would lead you to believe the restaurant review is 25% positive and 75% negative since there were 4 subjectivity terms
# Start by performing the inner_join() on a unified tidy data frame containing 4 books, Agamemnon, Oz, Huck Finn, and Moby Dick
# Just like the previous exercise you will use filter() and grepl()
# To perform the count() you have to group the data by book and then sentiment
# For example all the positive words for Agamemnon have to be grouped then tallied so that positive words from all books are not mixed
# Luckily, you can pass multiple variables into count() directly
# Forward book_sents, which is the NRC inner join to all tidy books, to filter()
# Review tail of all_books
tail(all_books)
## # A tibble: 6 x 5
## term document count author book
## <chr> <chr> <dbl> <chr> <chr>
## 1 ebooks 19117 1.00 twain innocents_abroad
## 2 email 19117 1.00 twain innocents_abroad
## 3 hear 19117 1.00 twain innocents_abroad
## 4 new 19117 1.00 twain innocents_abroad
## 5 newsletter 19117 1.00 twain innocents_abroad
## 6 subscribe 19117 1.00 twain innocents_abroad
# Inner join
books_sents <- inner_join(all_books, nrc_lex, by=c("term"="word"))
# Keep only positive or negative
books_pos_neg <- books_sents %>%
filter(grepl("positive|negative", sentiment))
# Review tail again
tail(books_pos_neg)
## # A tibble: 6 x 6
## term document count author book sentiment
## <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 included 19106 1.00 twain innocents_abroad positive
## 2 compliance 19107 1.00 twain innocents_abroad positive
## 3 main 19110 1.00 twain innocents_abroad positive
## 4 information 19114 1.00 twain innocents_abroad positive
## 5 including 19115 1.00 twain innocents_abroad positive
## 6 foundation 19116 1.00 twain innocents_abroad positive
# Count by book & sentiment
books_sent_count <- books_pos_neg %>%
count(book, sentiment)
# Review entire object
books_sent_count
## # A tibble: 22 x 3
## book sentiment n
## <chr> <chr> <int>
## 1 bartleby negative 537
## 2 bartleby positive 864
## 3 confidence_man negative 3561
## 4 confidence_man positive 5899
## 5 ct_yankee negative 4048
## 6 ct_yankee positive 6154
## 7 hamlet negative 1677
## 8 hamlet positive 2250
## 9 huck_finn negative 2471
## 10 huck_finn positive 3544
## # ... with 12 more rows
# Split, make proportional
book_pos <- books_sent_count %>%
group_by(book) %>%
mutate(percent_positive = n / sum(n) * 100)
# Proportional bar plot
ggplot(book_pos, aes(x = book, y = percent_positive, fill = sentiment)) +
geom_bar(stat = "identity")
# We've loaded ag as a tidy version of Agamemnon and created afinn as a subset of the tidytext "afinn" lexicon
# Agamemnon inner join
ag_afinn <- inner_join(ag_tidy, afinn_lex, by=c("term"="word")) %>%
mutate(line=as.numeric(document)) %>%
select(-document)
# Add book
ag_afinn$book <- "agamemnon"
# Oz inner join (use jc instead)
jc_afinn <- inner_join(jc, afinn_lex, by=c("term"="word"))
# Add book
jc_afinn$book <- "jc"
# Combine
all_df <- rbind(ag_afinn, jc_afinn)
# Plot 2 densities
ggplot(all_df, aes(x = score, fill = book)) +
geom_density(alpha = 0.3) +
ggthemes::theme_gdocs() +
ggtitle("AFINN Score Densities")
# In this exercise the all_book_polarity object is already loaded
# The data frame contains two columns, book and polarity
# It comprises all books with qdap's polarity() function applied
all_book_polarity <- readRDS("./RInputFiles/all_book_polarity.rds")
# Examine
str(all_book_polarity)
## 'data.frame': 14437 obs. of 2 variables:
## $ book : Factor w/ 4 levels "huck","agamemnon",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ polarity: num 0.277 0.258 -0.577 0.25 0.516 ...
# Summary by document
tapply(all_book_polarity$polarity, all_book_polarity$book, FUN=summary)
## $huck
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.38700 -0.25820 0.23570 0.04156 0.26730 1.60400
##
## $agamemnon
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.4670 -0.3780 -0.3333 -0.1266 0.3333 1.2250
##
## $moby
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.13300 -0.28870 -0.25000 -0.02524 0.28870 1.84800
##
## $oz
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.2730 -0.2774 0.2582 0.0454 0.2887 1.1880
# Box plot
ggplot(all_book_polarity, aes(x = book, y = polarity)) +
geom_boxplot(fill = c("#bada55", "#F00B42", "#F001ED", "#BA6E15"), col = "darkred") +
geom_jitter(position = position_jitter(width = 0.1, height = 0), alpha = 0.02) +
ggthemes::theme_gdocs() +
ggtitle("Book Polarity")
# Remember Plutchik's wheel of emotion?
# The NRC lexicon has the 8 emotions corresponding to the first ring of the wheel
# Previously you created a comparison.cloud() according to the 8 primary emotions
# Now you will create a radar chart similar to the wheel in this exercise
# A radarchart is a two-dimensional representation of multidimensional data (at least 3)
# In this case the tally of the different emotions for a book are represented in the chart
# Using a radar chart, you can review all 8 emotions simultaneously
# As before we've loaded the "nrc" lexicon as nrc and moby_huck which is a combined tidy version of both Moby Dick and Huck Finn
bindMoby <- m_dick_tidy %>%
mutate(document=as.numeric(document), book="moby")
bindHuck <- huck %>%
mutate(book="huck") %>%
rename(document=line)
moby_huck <- rbind(bindMoby, bindHuck)
moby_huck
## # A tibble: 165,194 x 4
## term document count book
## * <chr> <dbl> <dbl> <chr>
## 1 chapter 2.00 1.00 moby
## 2 loomings 2.00 1.00 moby
## 3 agonever 5.00 1.00 moby
## 4 call 5.00 1.00 moby
## 5 ishmael 5.00 1.00 moby
## 6 long 5.00 1.00 moby
## 7 mind 5.00 1.00 moby
## 8 preciselyhaving 5.00 1.00 moby
## 9 some 5.00 1.00 moby
## 10 years 5.00 1.00 moby
## # ... with 165,184 more rows
# Review tail of moby_huck
tail(moby_huck)
## # A tibble: 6 x 4
## term document count book
## <chr> <dbl> <dbl> <chr>
## 1 subscribe 11788 1.00 huck
## 2 ebooks 11789 1.00 huck
## 3 email 11789 1.00 huck
## 4 hear 11789 1.00 huck
## 5 new 11789 1.00 huck
## 6 newsletter 11789 1.00 huck
# Inner join
books_sents <- inner_join(moby_huck, nrc_lex, by=c("term"="word"))
# Drop positive or negative
books_pos_neg <- books_sents %>%
filter(!grepl("positive|negative", sentiment))
# Tidy tally
books_tally <- books_pos_neg %>%
group_by(book, sentiment) %>%
tally()
# Key value pairs
scores <- books_tally %>%
tidyr::spread(book, n)
# Review scores
scores
## # A tibble: 8 x 3
## sentiment huck moby
## <chr> <int> <int>
## 1 anger 1123 2811
## 2 anticipation 2214 4740
## 3 disgust 823 2025
## 4 fear 1332 4178
## 5 joy 1713 3175
## 6 sadness 1303 3393
## 7 surprise 1154 2153
## 8 trust 2191 5099
# JavaScript radar chart
radarchart::chartJSRadar(scores)
# Make the scores relatove to total
scoresRelative <- scores %>%
mutate(huckRel = huck/sum(huck), mobyRel=moby/sum(moby))
scoresRelative
## # A tibble: 8 x 5
## sentiment huck moby huckRel mobyRel
## <chr> <int> <int> <dbl> <dbl>
## 1 anger 1123 2811 0.0947 0.102
## 2 anticipation 2214 4740 0.187 0.172
## 3 disgust 823 2025 0.0694 0.0734
## 4 fear 1332 4178 0.112 0.152
## 5 joy 1713 3175 0.145 0.115
## 6 sadness 1303 3393 0.110 0.123
## 7 surprise 1154 2153 0.0974 0.0781
## 8 trust 2191 5099 0.185 0.185
# JavaScript radar chart
radarchart::chartJSRadar(scoresRelative[, c("sentiment", "huckRel", "mobyRel")])
# Often you will find yourself working with documents in groups, such as author, product or by company
# This exercise lets you learn about the text while retaining the groups in a compact visual
# For example, with customer reviews grouped by product you may want to explore multiple dimensions of the customer reviews at the same time
# First you could calculate the polarity() of the reviews. Another dimension may be length
# Document length can demonstrate the emotional intensity
# If a customer leaves a short "great shoes!" one could infer they are actually less enthusiastic compared to a lengthier positive review
# You may also want to group reviews by product type such as women's, men's and children's shoes. A treemap lets you examine all of these dimensions
# For text analysis, within a treemap each individual box represents a document such as a tweet
# Documents are grouped in some manner such as author
# The size of each box is determined by a numeric value such as number of words or letters
# The individual colors are determined by a sentiment score
# After you organize the tibble, you use the treemap library containing the function treemap() to make the visual
# The code example below declares the data, grouping variables, size, color and other aesthetics
# treemap(data_frame,
# index = c("group", "individual_document"),
# vSize = "V1",
# vColor = "avg_score",
# type = "value",
# title = "Book Sentiment Scores",
# palette = c("red", "white", "green"))
# The pre-loaded all_books object contains a combined tidy format corpus with 4 Shakespeare, 3 Melville and 4 Twain books
# Based on the treemap you should be able to tell who writes longer books, and the polarity of the author as a whole and for individual books
books_score <- all_books %>%
# Inner join with AFINN scores
inner_join(afinn_lex, by=c("term" = "word"))
book_length <- books_score %>%
# Count number of words per book
count(book)
book_score <- books_score %>%
# Group by author, book
group_by(author, book) %>%
# Calculate mean book score
summarize(mean_score = mean(score))
book_tree <- book_score %>%
# Inner join by book
inner_join(book_length, by=c("book"))
# Examine the results
book_tree
## # A tibble: 11 x 4
## # Groups: author [?]
## author book mean_score n
## <chr> <chr> <dbl> <int>
## 1 melville bartleby 0.101 761
## 2 melville confidence_man 0.506 5480
## 3 melville moby_dick 0.161 8973
## 4 shakespeare hamlet 0.0984 2064
## 5 shakespeare julius_caesar 0.0846 1359
## 6 shakespeare macbeth 0.222 910
## 7 shakespeare romeo_juliet 0.175 1978
## 8 twain ct_yankee 0.199 6083
## 9 twain huck_finn 0.0763 4849
## 10 twain innocents_abroad 0.405 8988
## 11 twain tom_sawyer -0.0265 3741
# Make the visual
treemap::treemap(book_tree,
index = c("author", "book"),
vSize = "n",
vColor = "mean_score",
type = "value",
title = "Book Sentiment Scores",
palette = c("red", "white", "green")
)
Chapter 4 - Case Study: Airbnb
Refresher on text mining workflow:
Organize and clean the text:
Feature extraction and analysis:
Draw conclusions:
Next steps:
Example code includes:
# The Boston property rental reviews are stored in a CSV file located by the predefined variable bos_reviews_file
# bos_reviews_file has been pre-defined
# bos_reviews_file
# load raw text
# bos_reviews <- read.csv(bos_reviews_file, stringsAsFactors = FALSE)
bos_reviews <- readRDS("./RInputFiles/bos_reviews.rds")
# Structure
str(bos_reviews)
## 'data.frame': 1000 obs. of 2 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ comments: chr "My daughter and I had a wonderful stay with Maura. She kept in close touch with us throughout the day as we weren't arriving ti"| __truncated__ "We stay at Elizabeth's place for 3 nights in October 2014.\nThe apartment is really a great place to stay. \nLovely decorated a"| __truncated__ "If you're staying in South Boston, this is a terrific place to camp out. The apartment and bedroom are lovely, Ellie is an exce"| __truncated__ "Derian and Brian were great and prompt with their communications with us. The room was as described; it was a small nice and cl"| __truncated__ ...
# Dimensions
dim(bos_reviews)
## [1] 1000 2
# Using a kernel density plot you should notice the reviews do not center on 0. Often there are two causes for this sentiment "grade inflation."
# First, social norms may lead respondents to be pleasant instead of neutral
# This, of course, is channel specific
# Particularly snarky channels like e-sports or social media posts may skew negative leading to "deflation."
# These channels have different expectations
# A second possible reason could be "feature based sentiment".
# In some reviews an author may write "the bed was comfortable and nice but the kitchen was dirty and gross."
# The sentiment of this type of review encompasses multiple features simultaneously and therefore could make an average score skewed
# In a subsequent exercise you will adjust this "grade inflation" but here explore the reviews without any change
# We've also loaded a larger polarity object for all 1000 comments
# This new object is called bos_pol
# Now apply summary() to the correct list element that returns all polarity scores of bos_pol
# Practice apply polarity to first 6 reviews
practice_pol <- polarity(bos_reviews$comments[1:6])
## Warning in polarity(bos_reviews$comments[1:6]):
## Some rows contain double punctuation. Suggested use of `sentSplit` function.
# Review the object
practice_pol
## all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 all 6 390 0.747 0.398 1.875
# Check out the practice polarity
summary(practice_pol$all$polarity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2500 0.5009 0.6594 0.7466 1.0780 1.2460
# Summary for all reviews
bos_pol <- polarity(bos_reviews$comments)
## Warning in polarity(bos_reviews$comments):
## Some rows contain double punctuation. Suggested use of `sentSplit` function.
summary(bos_pol$all$polarity)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.9712 0.6047 0.8921 0.9022 1.2060 3.7510 1
# Plot it
ggplot(bos_pol$all, aes(x = polarity, y = ..density..)) +
ggthemes::theme_gdocs() +
geom_histogram(binwidth = 0.25, fill = "#bada55", colour = "grey60") +
geom_density(size = 0.75)
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing non-finite values (stat_density).
# In this exercise you will perform Step 3 of the text mining workflow
# Although qdap isn't a tidy package you will mutate() a new column based on the returned polarity list representing all polarity (that's a hint BTW) scores
# In chapter 3 we used a custom function pol_subsections which uses only base R declarations
# However, in following the tidy principles this exercise uses filter() then introduces pull()
# The pull() function works like works like [[ to extract a single variable
# Once segregated you collapse all the positive and negative comments into two larger documents representing all words among the positive and negative rental reviews
# Lastly, you will create a Term Frequency Inverse Document Frequency (TFIDF) weighted Term Document Matrix (TDM)
# Since this exercise code starts with a tidy structure, some of the functions borrowed from tm are used along with the %>% operator to keep the style consistent
# If the basics of the tm package aren't familiar check out the Text Mining: Bag of Words course
# Instead of counting the number of times a word is used (frequency), the values in the TDM are penalized for over used terms, which helps reduce non-informative words
# Review
bos_pol$group
## all total.sentences total.words ave.polarity sd.polarity
## 1 all 1000 70481 0.9021735 0.5015318
## stan.mean.polarity
## 1 1.798836
# Add polarity column
bos_reviews_with_pol <- bos_reviews %>%
mutate(polarity = bos_pol$all$polarity)
# Subset positive comments
pos_comments <- bos_reviews_with_pol %>%
filter(polarity > 0) %>%
pull(comments)
# Subset negative comments
neg_comments <- bos_reviews_with_pol %>%
filter(polarity < 0) %>%
pull(comments)
# Paste and collapse the positive comments
pos_terms <- paste(pos_comments, collapse = " ")
# Paste and collapse the negative comments
neg_terms <- paste(neg_comments, collapse = " ")
# Concatenate the terms
all_terms <- c(pos_terms, neg_terms)
# Pipe a VectorSource Corpus
all_corpus <- all_terms %>%
VectorSource() %>%
VCorpus()
# Simple TFIDF TDM
all_tdm <- TermDocumentMatrix(
all_corpus,
control = list(
weighting = weightTfIdf,
removePunctuation = TRUE,
stopwords = stopwords(kind = "en")
)
)
# Examine the TDM
all_tdm
## <<TermDocumentMatrix (terms: 4967, documents: 2)>>
## Non-/sparse entries: 4350/5584
## Sparsity : 56%
## Maximal term length: 93
## Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
# Previously you learned that applying tidy() on a TermDocumentMatrix() object will convert the TDM to a tibble
# In this exercise you will create the word data directly from the review column called comments
# First you use unnest_tokens() to make the text lowercase and tokenize the reviews into single words
# Sometimes it is useful to capture the original word order within each group of a corpus
# To do so, use mutate(). In mutate() you will use seq_along() to create a sequence of numbers from 1 to the length of the object
# This will capture the word order as it was written
# In the tm package, you would use removeWords() to remove stopwords
# In the tidyverse you first need to load the stop words lexicon and then apply an anti_join() between the tidy text data frame and the stopwords
# Load the premade "SMART" stopwords to your R session with data("stop_words")
# Vector to tibble
tidy_reviews <- bos_reviews %>%
tidytext::unnest_tokens(word, comments)
# Group by and mutate
tidy_reviews <- tidy_reviews %>%
group_by(id) %>%
mutate(original_word_order = seq_along(word))
# Quick review
tidy_reviews
## # A tibble: 70,986 x 3
## # Groups: id [1,000]
## id word original_word_order
## <int> <chr> <int>
## 1 1 my 1
## 2 1 daughter 2
## 3 1 and 3
## 4 1 i 4
## 5 1 had 5
## 6 1 a 6
## 7 1 wonderful 7
## 8 1 stay 8
## 9 1 with 9
## 10 1 maura 10
## # ... with 70,976 more rows
# Load stopwords
data("stop_words", package="tidytext")
# Perform anti-join
tidy_reviews_without_stopwords <- tidy_reviews %>%
anti_join(stop_words)
## Joining, by = "word"
# Here you will learn that differing sentiment methods will cause different results
# Often you will simply need to have results align directionally although the specifics may be different
# In the last exercise you created tidy_reviews which is a data frame of rental reviews without stopwords
# Earlier in the chapter, you calculated and plotted qdap's basic polarity() function
# This showed you the reviews tend to be positive
# Now let's perform a similar analysis the tidytext way!
# Recall from an earlier chapter you will perform an inner_join() followed by count() and then a spread()
# Lastly, you will create a new column using mutate() and passing in positive - negative.
# Get the correct lexicon
bing <- tidytext::get_sentiments("bing")
# Calculate polarity for each review
pos_neg <- tidy_reviews_without_stopwords %>%
inner_join(bing, by=c("word")) %>%
count(sentiment) %>%
tidyr::spread(sentiment, n, fill = 0) %>%
mutate(polarity = positive - negative)
# Check outcome
summary(pos_neg)
## id negative positive polarity
## Min. : 1.0 Min. : 0.0000 Min. : 0.000 Min. :-11.000
## 1st Qu.: 253.0 1st Qu.: 0.0000 1st Qu.: 3.000 1st Qu.: 2.000
## Median : 498.0 Median : 0.0000 Median : 4.000 Median : 4.000
## Mean : 500.4 Mean : 0.6128 Mean : 4.965 Mean : 4.353
## 3rd Qu.: 748.0 3rd Qu.: 1.0000 3rd Qu.: 7.000 3rd Qu.: 6.000
## Max. :1000.0 Max. :14.0000 Max. :28.000 Max. : 26.000
# Often authors will use more words when they are more passionate
# For example, a mad airline passenger will leave a longer review the worse (the perceived) service
# Conversely a less impassioned passenger may not feel compelled to spend a lot of time writing a review
# Lengthy reviews may inflate overall sentiment since the reviews will inherently contain more positive or negative language as the review lengthens
# This coding exercise helps to examine effort and sentiment
# In this exercise you will visualize the relationship between effort and sentiment
# Recall your rental review tibble contains an id and that a word is represented in each row
# As a result a simple count() of the id will capture the number of words used in each review
# Then you will join this summary to the positive and negative data
# Ultimately you will create a scatter plot that will visualize author review length and its relationship to polarity
# tidy_reviews and pos_neg from the previous exercises are available in your workspace
# Review tidy_reviews
tidy_reviews_without_stopwords
## # A tibble: 26,247 x 3
## # Groups: id [?]
## id word original_word_order
## <int> <chr> <int>
## 1 1 daughter 2
## 2 1 wonderful 7
## 3 1 stay 8
## 4 1 maura 10
## 5 1 close 14
## 6 1 touch 15
## 7 1 day 20
## 8 1 arriving 24
## 9 1 til 25
## 10 1 evening 29
## # ... with 26,237 more rows
# Review pos_neg
pos_neg
## # A tibble: 953 x 4
## # Groups: id [953]
## id negative positive polarity
## <int> <dbl> <dbl> <dbl>
## 1 1 0 4.00 4.00
## 2 2 0 3.00 3.00
## 3 3 0 3.00 3.00
## 4 4 0 6.00 6.00
## 5 5 0 2.00 2.00
## 6 6 0 3.00 3.00
## 7 7 0 5.00 5.00
## 8 8 0 2.00 2.00
## 9 9 0 4.00 4.00
## 10 10 1.00 15.0 14.0
## # ... with 943 more rows
# Create effort
effort <- tidy_reviews_without_stopwords %>%
count(id)
# Inner join
pos_neg_with_effort <- pos_neg %>%
inner_join(effort, by=c("id"))
# Review
pos_neg_with_effort
## # A tibble: 953 x 5
## # Groups: id [?]
## id negative positive polarity n
## <int> <dbl> <dbl> <dbl> <int>
## 1 1 0 4.00 4.00 26
## 2 2 0 3.00 3.00 27
## 3 3 0 3.00 3.00 16
## 4 4 0 6.00 6.00 32
## 5 5 0 2.00 2.00 8
## 6 6 0 3.00 3.00 21
## 7 7 0 5.00 5.00 18
## 8 8 0 2.00 2.00 10
## 9 9 0 4.00 4.00 12
## 10 10 1.00 15.0 14.0 46
## # ... with 943 more rows
# Add pol
pos_neg_pol <- pos_neg_with_effort %>%
mutate(
pol = ifelse(
polarity >= 0,
"Positive",
"Negative"
)
)
# Plot
ggplot(pos_neg_pol, aes(polarity, n, color = pol)) +
geom_point(alpha = 0.25) +
geom_smooth(method = "lm", se = FALSE) +
ggthemes::theme_gdocs() +
ggtitle("Relationship between word effort & polarity")
# This exercise will create a common visual for you to understand term frequency
# Specifically, you will review the most frequent terms from among the positive and negative collapsed documents
# Recall the TermDocumentMatrix all_tdm you created earlier
# Instead of 1000 rental reviews the matrix contains 2 documents containing all reviews separated by the polarity() score
# It's usually easier to change the TDM to a matrix
# From there you simply rename the columns
# Remember that the colnames() function is called on the left side of the assignment operator as shown below
# colnames(OBJECT) <- c("COLUMN_NAME1", "COLUMN_NAME2")
# Once done, you will reorder the matrix to see the most positive and negative words. Review these terms so you can answer the conclusion exercises!
# Lastly, you'll visualize the terms using comparison.cloud().
# Matrix
all_tdm_m <- as.matrix(all_tdm)
# Column names
colnames(all_tdm_m) <- c("positive", "negative")
# Top pos words
order_by_pos <- order(all_tdm_m[, 1], decreasing = TRUE)
# Review top 10 pos words
all_tdm_m[order_by_pos, ] %>% head(n=10)
## Docs
## Terms positive negative
## walk 0.004557696 0
## definitely 0.004172956 0
## staying 0.003729024 0
## city 0.003285093 0
## wonderful 0.003107520 0
## restaurants 0.003048329 0
## highly 0.002959543 0
## station 0.002693184 0
## enjoyed 0.002426825 0
## subway 0.002397230 0
# Top neg words
order_by_neg <- order(all_tdm_m[, 2], decreasing = TRUE)
# Review top 10 neg words
all_tdm_m[order_by_neg, ] %>% head(n=10)
## Docs
## Terms positive negative
## condition 0 0.002159827
## don´t 0 0.002159827
## demand 0 0.001439885
## disappointed 0 0.001439885
## dumpsters 0 0.001439885
## hygiene 0 0.001439885
## inform 0 0.001439885
## it´s 0 0.001439885
## nasty 0 0.001439885
## safety 0 0.001439885
# Get rid of non-alphanumeric (including weird punctuation)
delLines <- grepl(pattern="[^a-zA-Z\\d]", x=rownames(all_tdm_m))
sum(delLines)
## [1] 245
length(delLines)
## [1] 4967
# Comparison cloud
wordcloud::comparison.cloud(
all_tdm_m[!delLines, ],
max.words = 20,
colors = c("darkgreen","darkred")
)
## Warning in wordcloud::comparison.cloud(all_tdm_m[!delLines, ], max.words =
## 20, : sounds could not be fit on page. It will not be plotted.
## Warning in wordcloud::comparison.cloud(all_tdm_m[!delLines, ], max.words =
## 20, : speaking could not be fit on page. It will not be plotted.
## Warning in wordcloud::comparison.cloud(all_tdm_m[!delLines, ], max.words =
## 20, : unsafe could not be fit on page. It will not be plotted.
# Recall the "grade inflation" of polarity scores on the rental reviews?
# Sometimes, another way to uncover an insight is to scale the scores back to 0 then perform the corpus subset
# This means some of the previously positive comments may become part of the negative subsection or vice versa since the mean is changed to 0
# This exercise will help you scale the scores and then re-plot the comparison.cloud()
# Removing the "grade inflation" can help provide additional insights
# Previously you applied polarity() to the bos_reviews$comments and created a comparison.cloud()
# In this exercise you will scale() the outcome before creating the comparison.cloud()
# See if this shows something different in the visual!
# Review
bos_pol$all[1:6, 1:3]
## all wc polarity
## 1 all 77 1.1851900
## 2 all 78 1.2455047
## 3 all 39 0.4803845
## 4 all 101 0.7562283
## 5 all 16 0.2500000
## 6 all 79 0.5625440
# Scale/center & append
bos_reviews$scaled_polarity <- scale(bos_pol$all$polarity)
# Subset positive comments
pos_comments <- subset(bos_reviews$comments, bos_reviews$scaled_polarity > 0)
# Subset negative comments
neg_comments <- subset(bos_reviews$comments, bos_reviews$scaled_polarity < 0)
# Paste and collapse the positive comments
pos_terms <- paste(pos_comments, collapse = " ")
# Paste and collapse the negative comments
neg_terms <- paste(neg_comments, collapse = " ")
# Organize
all_terms<- c(pos_terms, neg_terms)
# VCorpus
all_corpus <- VCorpus(VectorSource(all_terms))
# TDM
all_tdm <- TermDocumentMatrix(
all_corpus,
control = list(
weighting = weightTfIdf,
removePunctuation = TRUE,
stopwords = stopwords(kind = "en")
)
)
# Column names
all_tdm_m <- as.matrix(all_tdm)
colnames(all_tdm_m) <- c("positive", "negative")
# Comparison cloud
wordcloud::comparison.cloud(
all_tdm_m,
max.words = 40,
colors = c("darkgreen", "darkred")
)
## Warning in wordcloud::comparison.cloud(all_tdm_m, max.words = 40, colors
## = c("darkgreen", : suggested could not be fit on page. It will not be
## plotted.
## Warning in wordcloud::comparison.cloud(all_tdm_m, max.words = 40, colors
## = c("darkgreen", : amazingly could not be fit on page. It will not be
## plotted.
## Warning in wordcloud::comparison.cloud(all_tdm_m, max.words = 40, colors
## = c("darkgreen", : appliances could not be fit on page. It will not be
## plotted.
## Warning in wordcloud::comparison.cloud(all_tdm_m, max.words = 40, colors
## = c("darkgreen", : luxurious could not be fit on page. It will not be
## plotted.
## Warning in wordcloud::comparison.cloud(all_tdm_m, max.words = 40, colors =
## c("darkgreen", : meetings could not be fit on page. It will not be plotted.
## Warning in wordcloud::comparison.cloud(all_tdm_m, max.words = 40, colors
## = c("darkgreen", : mentioned could not be fit on page. It will not be
## plotted.
## Warning in wordcloud::comparison.cloud(all_tdm_m, max.words = 40, colors
## = c("darkgreen", : particularly could not be fit on page. It will not be
## plotted.
## Warning in wordcloud::comparison.cloud(all_tdm_m, max.words = 40, colors =
## c("darkgreen", : phyllis could not be fit on page. It will not be plotted.
## Warning in wordcloud::comparison.cloud(all_tdm_m, max.words = 40, colors =
## c("darkgreen", : spaces could not be fit on page. It will not be plotted.
## Warning in wordcloud::comparison.cloud(all_tdm_m, max.words = 40, colors =
## c("darkgreen", : unique could not be fit on page. It will not be plotted.
Chapter 1 - Tweets Across the United States
Sentiment analysis and tidy tools:
Sentiment analysis via inner join:
Using dplyr verbs to analysis sentiment analysis results:
Looking at differences by state:
Example code includes:
# Choose the bing lexicon
tidytext::get_sentiments("bing")
## # A tibble: 6,788 x 2
## word sentiment
## <chr> <chr>
## 1 2-faced negative
## 2 2-faces negative
## 3 a+ positive
## 4 abnormal negative
## 5 abolish negative
## 6 abominable negative
## 7 abominably negative
## 8 abominate negative
## 9 abomination negative
## 10 abort negative
## # ... with 6,778 more rows
# Choose the nrc lexicon
tidytext::get_sentiments("nrc") %>%
count(sentiment) # Count words by sentiment
## # A tibble: 10 x 2
## sentiment n
## <chr> <int>
## 1 anger 1247
## 2 anticipation 839
## 3 disgust 1058
## 4 fear 1476
## 5 joy 689
## 6 negative 3324
## 7 positive 2312
## 8 sadness 1191
## 9 surprise 534
## 10 trust 1231
# geocoded_tweets has been pre-defined
load("./RInputFiles/geocoded_tweets.rda")
geocoded_tweets
## # A tibble: 520,304 x 3
## state word freq
## <chr> <chr> <dbl>
## 1 alabama a 16256686
## 2 alabama a- 5491
## 3 alabama a-day 3992
## 4 alabama aa 4739
## 5 alabama aaliyah 8252
## 6 alabama aamu 4306
## 7 alabama aaron 19813
## 8 alabama ab 68032
## 9 alabama abandoned 4071
## 10 alabama abbeville 7153
## # ... with 520,294 more rows
# Access bing lexicon: bing
bing <- tidytext::get_sentiments("bing")
# Use data frame with text data
geocoded_tweets %>%
# With inner join, implement sentiment analysis using `bing`
inner_join(bing, by=c("word"))
## # A tibble: 64,303 x 4
## state word freq sentiment
## <chr> <chr> <dbl> <chr>
## 1 alabama abuse 7186 negative
## 2 alabama abused 3073 negative
## 3 alabama accomplish 5957 positive
## 4 alabama accomplished 13121 positive
## 5 alabama accomplishment 3036 positive
## 6 alabama accurate 28262 positive
## 7 alabama ache 7306 negative
## 8 alabama aching 5080 negative
## 9 alabama addict 5441 negative
## 10 alabama addicted 40389 negative
## # ... with 64,293 more rows
# Create the tweets_nrc data
nrc <- tidytext::get_sentiments("nrc")
tweets_nrc <- geocoded_tweets %>%
inner_join(nrc, by=c("word"))
# tweets_nrc has been pre-defined
tweets_nrc
## # A tibble: 210,027 x 4
## state word freq sentiment
## <chr> <chr> <dbl> <chr>
## 1 alabama abandoned 4071 anger
## 2 alabama abandoned 4071 fear
## 3 alabama abandoned 4071 negative
## 4 alabama abandoned 4071 sadness
## 5 alabama ability 12406 positive
## 6 alabama abortion 3267 disgust
## 7 alabama abortion 3267 fear
## 8 alabama abortion 3267 negative
## 9 alabama abortion 3267 sadness
## 10 alabama absolute 22956 positive
## # ... with 210,017 more rows
tweets_nrc %>%
# Filter to only choose the words associated with sadness
filter(sentiment == "sadness") %>%
# Group by word
group_by(word) %>%
# Use the summarize verb to find the mean frequency
summarize(freq = mean(freq)) %>%
# Arrange to sort in order of descending frequency
arrange(desc(freq))
## # A tibble: 585 x 2
## word freq
## <chr> <dbl>
## 1 hate 1253840
## 2 bad 984943
## 3 bitch 787774
## 4 hell 486259
## 5 crazy 447047
## 6 feeling 407562
## 7 leave 397806
## 8 mad 393559
## 9 music 373608
## 10 sick 362023
## # ... with 575 more rows
# (If you are familiar with geom_bar(stat = "identity"), geom_col() does the same thing.)
# tweets_nrc has been pre-defined
# tweets_nrc
joy_words <- tweets_nrc %>%
# Filter to choose only words associated with joy
filter(sentiment == "joy") %>%
# Group by each word
group_by(word) %>%
# Use the summarize verb to find the mean frequency
summarize(freq = mean(freq)) %>%
# Arrange to sort in order of descending frequency
arrange(desc(freq))
joy_words %>%
top_n(20) %>%
mutate(word = reorder(word, freq)) %>%
# Use aes() to put words on the x-axis and frequency on the y-axis
ggplot(aes(x=word, y=freq)) +
# Make a bar chart with geom_col()
geom_col() +
coord_flip()
## Selecting by freq
# tweets_nrc has been pre-defined
# tweets_nrc
tweets_nrc %>%
# Find only the words for the state of Utah and associated with joy
filter(state == "utah", sentiment == "joy") %>%
# Arrange to sort in order of descending frequency
arrange(desc(freq))
## # A tibble: 326 x 4
## state word freq sentiment
## <chr> <chr> <dbl> <chr>
## 1 utah love 4207322 joy
## 2 utah good 3035114 joy
## 3 utah happy 1402568 joy
## 4 utah pretty 902947 joy
## 5 utah fun 764045 joy
## 6 utah birthday 663439 joy
## 7 utah beautiful 653061 joy
## 8 utah friend 627522 joy
## 9 utah hope 571841 joy
## 10 utah god 536687 joy
## # ... with 316 more rows
tweets_nrc %>%
# Find only the words for the state of Louisiana and associated with joy
filter(state == "louisiana", sentiment == "joy") %>%
# Arrange to sort in order of descending frequency
arrange(desc(freq))
## # A tibble: 290 x 4
## state word freq sentiment
## <chr> <chr> <dbl> <chr>
## 1 louisiana love 3764157 joy
## 2 louisiana good 2758699 joy
## 3 louisiana baby 1184392 joy
## 4 louisiana happy 1176291 joy
## 5 louisiana god 882457 joy
## 6 louisiana birthday 740497 joy
## 7 louisiana money 677899 joy
## 8 louisiana hope 675559 joy
## 9 louisiana pretty 581242 joy
## 10 louisiana feeling 486367 joy
## # ... with 280 more rows
# For the last exercise in this chapter, you will determine how the overall sentiment of Twitter sentiment varies from state to state
# You will use a dataset called tweets_bing, which is the output of an inner join created just the same way that you did earlier
# Check out what tweets_bing looks like in the console
# Create the tweets_bing data
bing <- tidytext::get_sentiments("bing")
tweets_bing <- geocoded_tweets %>%
inner_join(bing, by=c("word"))
# tweets_bing has been pre-defined
tweets_bing
## # A tibble: 64,303 x 4
## state word freq sentiment
## <chr> <chr> <dbl> <chr>
## 1 alabama abuse 7186 negative
## 2 alabama abused 3073 negative
## 3 alabama accomplish 5957 positive
## 4 alabama accomplished 13121 positive
## 5 alabama accomplishment 3036 positive
## 6 alabama accurate 28262 positive
## 7 alabama ache 7306 negative
## 8 alabama aching 5080 negative
## 9 alabama addict 5441 negative
## 10 alabama addicted 40389 negative
## # ... with 64,293 more rows
tweets_bing %>%
# Group by two columns: state and sentiment
group_by(state, sentiment) %>%
# Use summarize to calculate the mean frequency for these groups
summarize(freq = mean(freq)) %>%
tidyr::spread(sentiment, freq) %>%
ungroup() %>%
# Calculate the ratio of positive to negative words
mutate(ratio = positive / negative,
state = reorder(state, ratio)) %>%
# Use aes() to put state on the x-axis and ratio on the y-axis
ggplot(aes(x=state, y=ratio)) +
# Make a plot with points using geom_point()
geom_point() +
coord_flip()
Chapter 2 - Shakespeare Gets Sentimental
Tidying Shakespeare plays:
Using count and mutate:
Sentiment contributions by individual words:
Which words are important in each play?
Example code includes:
load("./RInputFiles/shakespeare.rda")
# The data set shakespeare in available in the workspace
shakespeare
## # A tibble: 25,888 x 3
## title type text
## <chr> <chr> <chr>
## 1 The Tragedy of Romeo and Juliet Tragedy The Complete Works of William ~
## 2 The Tragedy of Romeo and Juliet Tragedy ""
## 3 The Tragedy of Romeo and Juliet Tragedy The Tragedy of Romeo and Juliet
## 4 The Tragedy of Romeo and Juliet Tragedy ""
## 5 The Tragedy of Romeo and Juliet Tragedy The Library of the Future Comp~
## 6 The Tragedy of Romeo and Juliet Tragedy Library of the Future is a Tra~
## 7 The Tragedy of Romeo and Juliet Tragedy ""
## 8 The Tragedy of Romeo and Juliet Tragedy ""
## 9 The Tragedy of Romeo and Juliet Tragedy <<THIS ELECTRONIC VERSION OF T~
## 10 The Tragedy of Romeo and Juliet Tragedy SHAKESPEARE IS COPYRIGHT 1990-~
## # ... with 25,878 more rows
# Pipe the shakespeare data frame to the next line
shakespeare %>%
# Use count to find out how many titles/types there are
count(title, type)
## # A tibble: 6 x 3
## title type n
## <chr> <chr> <int>
## 1 A Midsummer Night's Dream Comedy 3459
## 2 Hamlet, Prince of Denmark Tragedy 6776
## 3 Much Ado about Nothing Comedy 3799
## 4 The Merchant of Venice Comedy 4225
## 5 The Tragedy of Macbeth Tragedy 3188
## 6 The Tragedy of Romeo and Juliet Tragedy 4441
tidy_shakespeare <- shakespeare %>%
# Group by the titles of the plays
group_by(title) %>%
# Define a new column linenumber
mutate(linenumber=row_number()) %>%
# Transform the non-tidy text data to tidy text data
tidytext::unnest_tokens(word, text) %>%
ungroup()
# Pipe the tidy Shakespeare data frame to the next line
tidy_shakespeare %>%
# Use count to find out how many times each word is used
count(word, sort = TRUE)
## # A tibble: 10,736 x 2
## word n
## <chr> <int>
## 1 the 4651
## 2 and 4170
## 3 i 3296
## 4 to 3047
## 5 of 2645
## 6 a 2511
## 7 you 2287
## 8 my 1913
## 9 in 1836
## 10 that 1721
## # ... with 10,726 more rows
shakespeare_sentiment <- tidy_shakespeare %>%
# Implement sentiment analysis with the "bing" lexicon
inner_join(tidytext::get_sentiments("bing"), by=c("word"))
shakespeare_sentiment %>%
# Find how many positive/negative words each play has
count(title, sentiment)
## # A tibble: 12 x 3
## title sentiment n
## <chr> <chr> <int>
## 1 A Midsummer Night's Dream negative 681
## 2 A Midsummer Night's Dream positive 773
## 3 Hamlet, Prince of Denmark negative 1323
## 4 Hamlet, Prince of Denmark positive 1223
## 5 Much Ado about Nothing negative 767
## 6 Much Ado about Nothing positive 1127
## 7 The Merchant of Venice negative 740
## 8 The Merchant of Venice positive 962
## 9 The Tragedy of Macbeth negative 914
## 10 The Tragedy of Macbeth positive 749
## 11 The Tragedy of Romeo and Juliet negative 1235
## 12 The Tragedy of Romeo and Juliet positive 1090
sentiment_counts <- tidy_shakespeare %>%
# Implement sentiment analysis using the "bing" lexicon
inner_join(tidytext::get_sentiments("bing"), by=c("word")) %>%
# Count the number of words by title, type, and sentiment
count(title, type, sentiment)
sentiment_counts %>%
# Group by the titles of the plays
group_by(title) %>%
# Find the total number of words in each play
mutate(total = sum(n),
# Calculate the number of words divided by the total
percent = n / total) %>%
# Filter the results for only negative sentiment
filter(sentiment == "negative") %>%
arrange(percent)
## # A tibble: 6 x 6
## # Groups: title [6]
## title type sentiment n total percent
## <chr> <chr> <chr> <int> <int> <dbl>
## 1 Much Ado about Nothing Comedy negative 767 1894 0.405
## 2 The Merchant of Venice Comedy negative 740 1702 0.435
## 3 A Midsummer Night's Dream Comedy negative 681 1454 0.468
## 4 Hamlet, Prince of Denmark Tragedy negative 1323 2546 0.520
## 5 The Tragedy of Romeo and Juliet Tragedy negative 1235 2325 0.531
## 6 The Tragedy of Macbeth Tragedy negative 914 1663 0.550
# Notice what the line mutate(word = reorder(word, n)) does; it converts word from a character that would be plotted in alphabetical order to a factor that will be plotted in order of n
word_counts <- tidy_shakespeare %>%
# Implement sentiment analysis using the "bing" lexicon
inner_join(tidytext::get_sentiments("bing"), by=c("word")) %>%
# Count by word and sentiment
count(word, sentiment)
top_words <- word_counts %>%
# Group by sentiment
group_by(sentiment) %>%
# Take the top 10 for each sentiment
top_n(10) %>%
ungroup() %>%
# Make word a factor in order of n
mutate(word = reorder(word, n))
## Selecting by n
# Use aes() to put words on the x-axis and n on the y-axis
ggplot(top_words, aes(x=word, y=n, fill = sentiment)) +
# Make a bar chart with geom_col()
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free") +
coord_flip()
# Correct! The word “wilt” was used differently in Shakespeare's time and was not negative; the lexicon has misidentified it
# For example, from Romeo and Juliet, “For thou wilt lie upon the wings of night”
# It is important to explore the details of how words were scored when performing sentiment analyses
tidy_shakespeare %>%
# Count by title and word
count(title, word, sort = TRUE) %>%
# Implement sentiment analysis using the "afinn" lexicon
inner_join(tidytext::get_sentiments("afinn"), by=c("word")) %>%
# Filter to only examine the scores for Macbeth that are negative
filter(title == "The Tragedy of Macbeth", score < 0)
## # A tibble: 237 x 4
## title word n score
## <chr> <chr> <int> <int>
## 1 The Tragedy of Macbeth no 73 -1
## 2 The Tragedy of Macbeth fear 35 -2
## 3 The Tragedy of Macbeth death 20 -2
## 4 The Tragedy of Macbeth bloody 16 -3
## 5 The Tragedy of Macbeth poor 16 -2
## 6 The Tragedy of Macbeth strange 16 -1
## 7 The Tragedy of Macbeth dead 14 -3
## 8 The Tragedy of Macbeth leave 14 -1
## 9 The Tragedy of Macbeth fight 13 -1
## 10 The Tragedy of Macbeth charges 11 -2
## # ... with 227 more rows
sentiment_contributions <- tidy_shakespeare %>%
# Count by title and word
count(title, word, sort = TRUE) %>%
# Implement sentiment analysis using the "afinn" lexicon
inner_join(tidytext::get_sentiments("afinn"), by=c("word")) %>%
# Group by title
group_by(title) %>%
# Calculate a contribution for each word in each title
mutate(contribution = n * score / sum(n)) %>%
ungroup()
sentiment_contributions
## # A tibble: 2,366 x 5
## title word n score contribution
## <chr> <chr> <int> <int> <dbl>
## 1 Hamlet, Prince of Denmark no 143 -1 -0.0652
## 2 The Tragedy of Romeo and Juliet love 140 3 0.213
## 3 Much Ado about Nothing no 132 -1 -0.0768
## 4 Much Ado about Nothing hero 114 2 0.133
## 5 A Midsummer Night's Dream love 110 3 0.270
## 6 Hamlet, Prince of Denmark good 109 3 0.149
## 7 The Tragedy of Romeo and Juliet no 102 -1 -0.0518
## 8 Much Ado about Nothing good 93 3 0.162
## 9 The Merchant of Venice no 92 -1 -0.0630
## 10 Much Ado about Nothing love 91 3 0.159
## # ... with 2,356 more rows
sentiment_contributions %>%
# Filter for Hamlet
filter(title == "Hamlet, Prince of Denmark") %>%
# Arrange to see the most negative words
arrange(contribution)
## # A tibble: 493 x 5
## title word n score contribution
## <chr> <chr> <int> <int> <dbl>
## 1 Hamlet, Prince of Denmark no 143 -1 -0.0652
## 2 Hamlet, Prince of Denmark dead 33 -3 -0.0451
## 3 Hamlet, Prince of Denmark death 38 -2 -0.0347
## 4 Hamlet, Prince of Denmark madness 22 -3 -0.0301
## 5 Hamlet, Prince of Denmark mad 21 -3 -0.0287
## 6 Hamlet, Prince of Denmark fear 21 -2 -0.0192
## 7 Hamlet, Prince of Denmark poor 20 -2 -0.0182
## 8 Hamlet, Prince of Denmark hell 10 -4 -0.0182
## 9 Hamlet, Prince of Denmark grave 17 -2 -0.0155
## 10 Hamlet, Prince of Denmark ghost 32 -1 -0.0146
## # ... with 483 more rows
sentiment_contributions %>%
# Filter for The Merchant of Venice
filter(title == "The Merchant of Venice") %>%
# Arrange to see the most positive words
arrange(desc(contribution))
## # A tibble: 344 x 5
## title word n score contribution
## <chr> <chr> <int> <int> <dbl>
## 1 The Merchant of Venice good 63 3 0.129
## 2 The Merchant of Venice love 60 3 0.123
## 3 The Merchant of Venice fair 35 2 0.0479
## 4 The Merchant of Venice like 34 2 0.0466
## 5 The Merchant of Venice true 24 2 0.0329
## 6 The Merchant of Venice sweet 23 2 0.0315
## 7 The Merchant of Venice pray 42 1 0.0288
## 8 The Merchant of Venice better 21 2 0.0288
## 9 The Merchant of Venice justice 17 2 0.0233
## 10 The Merchant of Venice welcome 17 2 0.0233
## # ... with 334 more rows
# After these lines of code, you will have the number of positive and negative words used in each index-ed section of the play
# These sections will be 70 lines long in your analysis here
# You want a chunk of text that is not too small (because then the sentiment changes will be very noisy) and not too big (because then you will not be able to see plot structure)
# In an analysis of this type you may need to experiment with what size chunks to make; sections of 70 lines works well for these plays
tidy_shakespeare %>%
# Implement sentiment analysis using "bing" lexicon
inner_join(tidytext::get_sentiments("bing"), by=c("word")) %>%
# Count using four arguments
count(title, type, index = linenumber %/% 70, sentiment)
## # A tibble: 744 x 5
## title type index sentiment n
## <chr> <chr> <dbl> <chr> <int>
## 1 A Midsummer Night's Dream Comedy 0 negative 4
## 2 A Midsummer Night's Dream Comedy 0 positive 11
## 3 A Midsummer Night's Dream Comedy 1.00 negative 7
## 4 A Midsummer Night's Dream Comedy 1.00 positive 19
## 5 A Midsummer Night's Dream Comedy 2.00 negative 20
## 6 A Midsummer Night's Dream Comedy 2.00 positive 23
## 7 A Midsummer Night's Dream Comedy 3.00 negative 12
## 8 A Midsummer Night's Dream Comedy 3.00 positive 18
## 9 A Midsummer Night's Dream Comedy 4.00 negative 9
## 10 A Midsummer Night's Dream Comedy 4.00 positive 27
## # ... with 734 more rows
tidy_shakespeare %>%
inner_join(tidytext::get_sentiments("bing")) %>%
count(title, type, index = linenumber %/% 70, sentiment) %>%
# Spread sentiment and n across multiple columns
tidyr::spread(sentiment, n, fill = 0) %>%
# Use mutate to find net sentiment
mutate(sentiment = positive - negative)
## Joining, by = "word"
## # A tibble: 373 x 6
## title type index negative positive sentiment
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 A Midsummer Night's Dream Comedy 0 4.00 11.0 7.00
## 2 A Midsummer Night's Dream Comedy 1.00 7.00 19.0 12.0
## 3 A Midsummer Night's Dream Comedy 2.00 20.0 23.0 3.00
## 4 A Midsummer Night's Dream Comedy 3.00 12.0 18.0 6.00
## 5 A Midsummer Night's Dream Comedy 4.00 9.00 27.0 18.0
## 6 A Midsummer Night's Dream Comedy 5.00 11.0 21.0 10.0
## 7 A Midsummer Night's Dream Comedy 6.00 12.0 16.0 4.00
## 8 A Midsummer Night's Dream Comedy 7.00 9.00 6.00 - 3.00
## 9 A Midsummer Night's Dream Comedy 8.00 6.00 12.0 6.00
## 10 A Midsummer Night's Dream Comedy 9.00 19.0 12.0 - 7.00
## # ... with 363 more rows
tidy_shakespeare %>%
inner_join(tidytext::get_sentiments("bing")) %>%
count(title, type, index = linenumber %/% 70, sentiment) %>%
tidyr::spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative) %>%
# Put index on x-axis, sentiment on y-axis, and map comedy/tragedy to fill
ggplot(aes(x=index, y=sentiment, fill=type)) +
# Make a bar chart with geom_col()
geom_col() +
# Separate panels for each title with facet_wrap()
facet_wrap(~ title, scales = "free_x")
## Joining, by = "word"
Chapter 3 - Analyzing TV News
That’s the way it is:
Comparing TV stations:
Sentiment changes with time:
Example code includes:
# Take a look at the dataset of TV news text about climate change you will use in this chapter. The climate_text dataset contains almost 600 closed captioning snippets and four columns
# station, the TV news station where the text is from,
# show, the show on that station where the text was spoken,
# show_date, the broadcast date of the spoken text, and
# text, the actual text spoken on TV
# Type climate_text in the console to take a look at the dataset before getting started with transforming it to a tidy format.
load("./RInputFiles/climate_text.rda")
climate_text
## # A tibble: 593 x 4
## station show show_date text
## <chr> <chr> <dttm> <chr>
## 1 MSNBC Morning Meeting 2009-09-22 13:00:00 the i~
## 2 MSNBC Morning Meeting 2009-10-23 13:00:00 corpo~
## 3 CNN CNN Newsroom 2009-12-03 20:00:00 he sa~
## 4 CNN American Morning 2009-12-07 11:00:00 espec~
## 5 MSNBC Morning Meeting 2009-12-08 14:00:00 lots ~
## 6 MSNBC Countdown With Keith Olbermann 2009-12-10 06:00:00 so th~
## 7 CNN Sanjay Gupta MD 2009-12-12 12:30:00 let m~
## 8 CNN The Situation Room With Wolf Blitzer 2009-12-16 21:00:00 other~
## 9 MSNBC Countdown With Keith Olbermann 2009-12-19 01:00:00 let d~
## 10 MSNBC The Rachel Maddow Show 2010-01-08 04:00:00 you k~
## # ... with 583 more rows
data(stop_words, package="tidytext")
stop_words
## # A tibble: 1,149 x 2
## word lexicon
## <chr> <chr>
## 1 a SMART
## 2 a's SMART
## 3 able SMART
## 4 about SMART
## 5 above SMART
## 6 according SMART
## 7 accordingly SMART
## 8 across SMART
## 9 actually SMART
## 10 after SMART
## # ... with 1,139 more rows
# Pipe the climate_text dataset to the next line
tidy_tv <- climate_text %>%
# Transform the non-tidy text data to tidy text data
tidytext::unnest_tokens(word, text)
tidy_tv %>%
anti_join(stop_words) %>%
# Count by word with sort = TRUE
count(word, sort=TRUE)
## Joining, by = "word"
## # A tibble: 3,699 x 2
## word n
## <chr> <int>
## 1 climate 1627
## 2 change 1615
## 3 people 139
## 4 real 125
## 5 president 112
## 6 global 107
## 7 issue 87
## 8 trump 86
## 9 warming 85
## 10 issues 69
## # ... with 3,689 more rows
tidy_tv %>%
# Count by station
count(station) %>%
# Rename the new column station_total
rename(station_total = n)
## # A tibble: 3 x 2
## station station_total
## <chr> <int>
## 1 CNN 10713
## 2 FOX News 10876
## 3 MSNBC 19487
tv_sentiment <- tidy_tv %>%
# Group by station
group_by(station) %>%
# Define a new column station_total
mutate(station_total = n()) %>%
ungroup() %>%
# Implement sentiment analysis with the NRC lexicon
inner_join(tidytext::get_sentiments("nrc"), by=c("word"))
# Which stations use the most negative words?
tv_sentiment %>%
count(station, sentiment, station_total) %>%
# Define a new column percent
mutate(percent = n / station_total) %>%
# Filter only for negative words
filter(sentiment == "negative") %>%
# Arrange by percent
arrange(percent)
## # A tibble: 3 x 5
## station sentiment station_total n percent
## <chr> <chr> <int> <int> <dbl>
## 1 MSNBC negative 19487 526 0.0270
## 2 CNN negative 10713 331 0.0309
## 3 FOX News negative 10876 403 0.0371
# Now do the same but for positive words
tv_sentiment %>%
count(station, sentiment, station_total) %>%
mutate(percent = n / station_total) %>%
filter(sentiment == "positive") %>%
arrange(percent)
## # A tibble: 3 x 5
## station sentiment station_total n percent
## <chr> <chr> <int> <int> <dbl>
## 1 FOX News positive 10876 514 0.0473
## 2 CNN positive 10713 522 0.0487
## 3 MSNBC positive 19487 953 0.0489
tv_sentiment %>%
# Count by word and sentiment
count(word, sentiment) %>%
# Group by sentiment
group_by(sentiment) %>%
# Take the top 10 words for each sentiment
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
# Set up the plot with aes()
ggplot(aes(x=word, y=n, fill=sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ sentiment, scales = "free") +
coord_flip()
## Selecting by n
# Excellent!
# Notice that you see proper names like Gore and Trump, which should be treated as neutral, and that “change” was a strong driver of fear sentiment, even though it is by definition part of these texts on climate change
# It is important to see which words contribute to your sentiment scores so you can adjust the sentiment lexicons if appropriate
tv_sentiment %>%
# Filter for only negative words
filter(sentiment == "negative") %>%
# Count by word and station
count(word, station) %>%
# Group by station
group_by(station) %>%
# Take the top 10 words for each station
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(paste(word, station, sep = "__"), n)) %>%
# Set up the plot with aes()
ggplot(aes(x=word, y=n, fill=station)) +
geom_col(show.legend = FALSE) +
scale_x_discrete(labels = function(x) gsub("__.+$", "", x)) +
facet_wrap(~ station, nrow = 2, scales = "free") +
coord_flip()
## Selecting by n
sentiment_by_time <- tidy_tv %>%
# Define a new column using floor_date()
mutate(date = lubridate::floor_date(show_date, unit = "6 months")) %>%
# Group by date
group_by(date) %>%
mutate(total_words = n()) %>%
ungroup() %>%
# Implement sentiment analysis using the NRC lexicon
inner_join(tidytext::get_sentiments("nrc"), by=c("word"))
sentiment_by_time %>%
# Filter for positive and negative words
filter(sentiment == "positive" | sentiment == "negative") %>%
# Count by date, sentiment, and total_words
count(date, sentiment, total_words) %>%
ungroup() %>%
mutate(percent = n / total_words) %>%
# Set up the plot with aes()
ggplot(aes(x=date, y=percent, color=sentiment)) +
geom_line(size = 1.5) +
geom_smooth(method = "lm", se = FALSE, lty = 2) +
expand_limits(y = 0)
tidy_tv %>%
# Define a new column that rounds each date to the nearest 1 month
mutate(date = lubridate::floor_date(show_date, unit="1 months")) %>%
filter(word %in% c("threat", "hoax", "denier",
"real", "warming", "hurricane")) %>%
# Count by date and word
count(date, word) %>%
ungroup() %>%
# Set up your plot with aes()
ggplot(aes(x=date, y=n, color=word)) +
# Make facets by word
facet_wrap(~ word) +
geom_line(size = 1.5, show.legend = FALSE) +
expand_limits(y = 0)
Chapter 4 - Singing a Happy Song
Ranking pop songs through the years:
Connecting sentiment to other quantities:
Moving from song rank to year:
Wrap up:
Example code includes:
# Let's take a look at the dataset you will use in this final chapter to practice your sentiment analysis skills
# The song_lyrics dataset contains five columns
# rank, the rank a song achieved on the Billboard Year-End Hot 100,
# song, the song's title,
# artist, the artist who recorded the song,
# year, the year the song reached the given rank on the Billboard chart, and
# lyrics, the lyrics of the song
# This dataset contains over 5000 songs, from 1965 to the present
# The lyrics are all in one column, so they are not yet in a tidy format, ready for analysis using tidy tools
# It's your turn to tidy this text data!
load("./RInputFiles/song_lyrics.rda")
song_lyrics
## # A tibble: 4,831 x 5
## rank song artist year lyrics
## <int> <chr> <chr> <int> <chr>
## 1 1 wooly bully sam th~ 1965 sam the s~
## 2 2 i cant help myself sugar pie honey bunch four t~ 1965 sugar pie~
## 3 4 you were on my mind we five 1965 when i wo~
## 4 5 youve lost that lovin feelin the ri~ 1965 you never~
## 5 6 downtown petula~ 1965 when your~
## 6 7 help the be~ 1965 help i ne~
## 7 8 cant you hear my heart beat herman~ 1965 carterlew~
## 8 9 crying in the chapel elvis ~ 1965 you saw m~
## 9 10 my girl the te~ 1965 ive got s~
## 10 11 help me rhonda the be~ 1965 well sinc~
## # ... with 4,821 more rows
data(stop_words, package="tidytext")
stop_words
## # A tibble: 1,149 x 2
## word lexicon
## <chr> <chr>
## 1 a SMART
## 2 a's SMART
## 3 able SMART
## 4 about SMART
## 5 above SMART
## 6 according SMART
## 7 accordingly SMART
## 8 across SMART
## 9 actually SMART
## 10 after SMART
## # ... with 1,139 more rows
# Pipe song_lyrics to the next line
tidy_lyrics <- song_lyrics %>%
# Transform the lyrics column to a word column
tidytext::unnest_tokens(word, lyrics)
# Print tidy_lyrics
tidy_lyrics
## # A tibble: 1,602,879 x 5
## rank song artist year word
## <int> <chr> <chr> <int> <chr>
## 1 1 wooly bully sam the sham and the pharaohs 1965 sam
## 2 1 wooly bully sam the sham and the pharaohs 1965 the
## 3 1 wooly bully sam the sham and the pharaohs 1965 sham
## 4 1 wooly bully sam the sham and the pharaohs 1965 miscellaneous
## 5 1 wooly bully sam the sham and the pharaohs 1965 wooly
## 6 1 wooly bully sam the sham and the pharaohs 1965 bully
## 7 1 wooly bully sam the sham and the pharaohs 1965 wooly
## 8 1 wooly bully sam the sham and the pharaohs 1965 bully
## 9 1 wooly bully sam the sham and the pharaohs 1965 sam
## 10 1 wooly bully sam the sham and the pharaohs 1965 the
## # ... with 1,602,869 more rows
# For some next steps in this analysis, you need to know the total number of words sung in each song
# Use count() to count up the words per song, and then left_join() these word totals to the tidy data set
# You can specify exactly which column to use when joining the two data frames if you add by = "song"
totals <- tidy_lyrics %>%
# Count by song to find the word totals for each song
count(song) %>%
# Rename the new column
rename(total_words = n)
# Print totals
totals
## # A tibble: 4,341 x 2
## song total_words
## <chr> <int>
## 1 0 to 100 the catch up 894
## 2 1 2 3 4 sumpin new 670
## 3 1 2 3 red light 145
## 4 1 2 step 437
## 5 1 thing 532
## 6 100 pure love 590
## 7 100 years 257
## 8 123 220
## 9 18 and life 285
## 10 19 somethin 281
## # ... with 4,331 more rows
lyric_counts <- tidy_lyrics %>%
# Combine totals with tidy_lyrics using the "song" column
left_join(totals, by = c("song"))
# You have been practicing how to implement sentiment analysis with a join throughout this course
# After transforming the text of these songs to a tidy text dataset and preparing the data frame, the resulting data frame lyric_counts is ready for you to perform sentiment analysis once again
# Once you have done the sentiment analysis, you can learn which songs have the most sentiment words from the NRC lexicon
# Remember that the NRC lexicon has 10 categories of sentiment:
# anger
# anticipation
# disgust
# fear
# joy
# negative
# positive
# sadness
# surprise
# trust
lyric_sentiment <- lyric_counts %>%
# Implement sentiment analysis with the "nrc" lexicon
inner_join(tidytext::get_sentiments("nrc"), by=c("word"))
lyric_sentiment %>%
# Find how many sentiment words each song has
count(song, sentiment, sort = TRUE)
## # A tibble: 39,564 x 3
## song sentiment n
## <chr> <chr> <int>
## 1 baby positive 264
## 2 baby joy 255
## 3 real love positive 213
## 4 angel positive 193
## 5 disturbia negative 182
## 6 live your life positive 174
## 7 my love positive 173
## 8 angel joy 164
## 9 damn negative 164
## 10 disturbia sadness 164
## # ... with 39,554 more rows
# What songs have the highest proportion of negative words?
lyric_sentiment %>%
# Count using three arguments
count(song, sentiment, total_words) %>%
ungroup() %>%
# Make a new percent column with mutate
mutate(percent = n / total_words) %>%
# Filter for only negative words
filter(sentiment == "negative") %>%
# Arrange by descending percent
arrange(desc(percent))
## # A tibble: 4,169 x 5
## song sentiment total_words n percent
## <chr> <chr> <int> <int> <dbl>
## 1 bad boy negative 237 77 0.325
## 2 rack city negative 458 142 0.310
## 3 ill tumble 4 ya negative 269 79 0.294
## 4 time wont let me negative 154 42 0.273
## 5 bang bang my baby shot me down negative 163 40 0.245
## 6 the stroke negative 279 57 0.204
## 7 the wild boys negative 245 49 0.200
## 8 pop that thang negative 204 40 0.196
## 9 disturbia negative 956 182 0.190
## 10 dance a negative 407 72 0.177
## # ... with 4,159 more rows
# What songs have the highest proportion of positive words?
lyric_sentiment %>%
count(song, sentiment, total_words) %>%
ungroup() %>%
mutate(percent = n / total_words) %>%
filter(sentiment == "positive") %>%
arrange(desc(percent))
## # A tibble: 4,295 x 5
## song sentiment total_words n percent
## <chr> <chr> <int> <int> <dbl>
## 1 love to love you baby positive 240 78 0.325
## 2 dance dance dance yowsah yowsah yo~ positive 305 94 0.308
## 3 i got the feelin positive 141 35 0.248
## 4 i love music positive 252 61 0.242
## 5 sweet and innocent positive 218 51 0.234
## 6 me and baby brother positive 181 42 0.232
## 7 love hangover positive 173 40 0.231
## 8 sweet cream ladies positive 179 41 0.229
## 9 mighty love positive 482 110 0.228
## 10 keep feeling fascination positive 189 43 0.228
## # ... with 4,285 more rows
# The lyric_sentiment data frame that you created earlier by using inner_join() is available in your environment
# You can now explore how the sentiment score of a song is related to other aspects of that song
# First, start with Billboard rank, how high on the annual Billboard chart the song reached
# Do songs that use more positive or negative words achieve higher or lower ranks?
# Start with positive words, and make a visualization to see how these characteristics are related
lyric_sentiment %>%
filter(sentiment == "positive") %>%
# Count by song, Billboard rank, and the total number of words
count(song, rank, total_words) %>%
ungroup() %>%
# Use the correct dplyr verb to make two new columns
mutate(percent = n / total_words,
rank = 10 * floor(rank / 10)) %>%
ggplot(aes(as.factor(rank), percent)) +
# Make a boxplot
geom_boxplot()
lyric_sentiment %>%
# Filter for only negative words
filter(sentiment == "negative") %>%
# Count by song, Billboard rank, and the total number of words
count(song, rank, total_words) %>%
ungroup() %>%
# Mutate to make a percent column
mutate(percent = n / total_words,
rank = 10 * floor(rank / 10)) %>%
# Use ggplot to set up a plot with rank and percent
ggplot(aes(x=as.factor(rank), y=percent)) +
# Make a boxplot
geom_boxplot()
# How is negative sentiment changing over time?
lyric_sentiment %>%
# Filter for only negative words
filter(sentiment == "negative") %>%
# Count by song, year, and the total number of words
count(song, year, total_words) %>%
ungroup() %>%
mutate(percent = n / total_words,
year = 10 * floor(year / 10)) %>%
# Use ggplot to set up a plot with year and percent
ggplot(aes(x=as.factor(year), y=percent)) +
geom_boxplot()
# How is positive sentiment changing over time?
lyric_sentiment %>%
filter(sentiment == "positive") %>%
count(song, year, total_words) %>%
ungroup() %>%
mutate(percent = n / total_words,
year = 10 * floor(year / 10)) %>%
ggplot(aes(x=as.factor(year), y=percent)) +
geom_boxplot()
negative_by_year <- lyric_sentiment %>%
# Filter for negative words
filter(sentiment == "negative") %>%
count(song, year, total_words) %>%
ungroup() %>%
# Define a new column: percent
mutate(percent = n/total_words)
# Specify the model with percent as the response and year as the predictor
model_negative <- lm(percent ~ year, data = negative_by_year)
# Use summary to see the results of the model fitting
summary(model_negative)
##
## Call:
## lm(formula = percent ~ year, data = negative_by_year)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.030288 -0.017205 -0.005778 0.010505 0.294194
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.809e-02 5.022e-02 0.758 0.448
## year -3.720e-06 2.523e-05 -0.147 0.883
##
## Residual standard error: 0.02513 on 4624 degrees of freedom
## Multiple R-squared: 4.702e-06, Adjusted R-squared: -0.0002116
## F-statistic: 0.02174 on 1 and 4624 DF, p-value: 0.8828
positive_by_year <- lyric_sentiment %>%
filter(sentiment == "positive") %>%
# Count by song, year, and total number of words
count(song, year, total_words) %>%
ungroup() %>%
# Define a new column: percent
mutate(percent = n/total_words)
# Fit a linear model with percent as the response and year as the predictor
model_positive <- lm(percent ~ year, data=positive_by_year)
# Use summary to see the results of the model fitting
summary(model_positive)
##
## Call:
## lm(formula = percent ~ year, data = positive_by_year)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.058050 -0.024032 -0.007756 0.014774 0.269726
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.117e+00 6.859e-02 16.29 <2e-16 ***
## year -5.373e-04 3.446e-05 -15.59 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03495 on 4770 degrees of freedom
## Multiple R-squared: 0.0485, Adjusted R-squared: 0.0483
## F-statistic: 243.1 on 1 and 4770 DF, p-value: < 2.2e-16
Chapter 1 - What is Regression?
Introduction - for this course, regression will be about getting a numerical (rather than categorical) prediction:
Linear regression - fundamental method:
Predicting once you fit a model:
Wrap up for simple linear regression:
Example code includes:
unemployment <- readRDS("./RInputFiles/unemployment.rds")
bloodpressure <- readRDS("./RInputFiles/bloodpressure.rds")
# The data frame unemployment is in your workspace
# unemployment is loaded in the workspace
summary(unemployment)
## male_unemployment female_unemployment
## Min. :2.900 Min. :4.000
## 1st Qu.:4.900 1st Qu.:4.400
## Median :6.000 Median :5.200
## Mean :5.954 Mean :5.569
## 3rd Qu.:6.700 3rd Qu.:6.100
## Max. :9.800 Max. :7.900
# Define a formula to express female_unemployment as a function of male_unemployment
fmla <- female_unemployment ~ male_unemployment
# Print it
fmla
## female_unemployment ~ male_unemployment
# Use the formula to fit a model: unemployment_model
unemployment_model <- lm(fmla, data=unemployment)
# Print it
unemployment_model
##
## Call:
## lm(formula = fmla, data = unemployment)
##
## Coefficients:
## (Intercept) male_unemployment
## 1.4341 0.6945
# There are a variety of different ways to examine a model; each way provides different information
# We will use summary(), broom::glance(), and sigr::wrapFTest()
# broom and sigr are already loaded in your workspace
# Print unemployment_model
unemployment_model
##
## Call:
## lm(formula = fmla, data = unemployment)
##
## Coefficients:
## (Intercept) male_unemployment
## 1.4341 0.6945
# Call summary() on unemployment_model to get more details
summary(unemployment_model)
##
## Call:
## lm(formula = fmla, data = unemployment)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.77621 -0.34050 -0.09004 0.27911 1.31254
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.43411 0.60340 2.377 0.0367 *
## male_unemployment 0.69453 0.09767 7.111 1.97e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5803 on 11 degrees of freedom
## Multiple R-squared: 0.8213, Adjusted R-squared: 0.8051
## F-statistic: 50.56 on 1 and 11 DF, p-value: 1.966e-05
# Call glance() on unemployment_model to see the details in a tidier form
broom::glance(unemployment_model)
## r.squared adj.r.squared sigma statistic p.value df logLik
## 1 0.8213157 0.8050716 0.5802596 50.56108 1.965985e-05 2 -10.28471
## AIC BIC deviance df.residual
## 1 26.56943 28.26428 3.703714 11
# Call wrapFTest() on unemployment_model to see the most relevant details
sigr::wrapFTest(unemployment_model)
## [1] "F Test summary: (R2=0.821, F(1,11)=50.6, p=2e-05)."
# The objects unemployment, unemployment_model and newrates are in your workspace
newrates <- data.frame(male_unemployment=5)
# newrates is in your workspace
newrates
## male_unemployment
## 1 5
# Predict female unemployment in the unemployment data set
unemployment$prediction <- predict(unemployment_model)
# Make a plot to compare predictions to actual (prediction on x axis).
ggplot(unemployment, aes(x = prediction, y = female_unemployment)) +
geom_point() +
geom_abline(color = "blue")
# Predict female unemployment rate when male unemployment is 5%
pred <- predict(unemployment_model, newdata=newrates)
# Print it
pred
## 1
## 4.906757
# In this exercise, you will work with the blood pressure dataset (Source), and model blood_pressure as a function of weight and age.
# The data frame bloodpressure is in the workspace
# bloodpressure is in the workspace
summary(bloodpressure)
## blood_pressure age weight
## Min. :128.0 Min. :46.00 Min. :167
## 1st Qu.:140.0 1st Qu.:56.50 1st Qu.:186
## Median :153.0 Median :64.00 Median :194
## Mean :150.1 Mean :62.45 Mean :195
## 3rd Qu.:160.5 3rd Qu.:69.50 3rd Qu.:209
## Max. :168.0 Max. :74.00 Max. :220
# Create the formula and print it
fmla <- blood_pressure ~ age + weight
fmla
## blood_pressure ~ age + weight
# Fit the model: bloodpressure_model
bloodpressure_model <- lm(fmla, data=bloodpressure)
# Print bloodpressure_model and call summary()
bloodpressure_model
##
## Call:
## lm(formula = fmla, data = bloodpressure)
##
## Coefficients:
## (Intercept) age weight
## 30.9941 0.8614 0.3349
summary(bloodpressure_model)
##
## Call:
## lm(formula = fmla, data = bloodpressure)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4640 -1.1949 -0.4078 1.8511 2.6981
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.9941 11.9438 2.595 0.03186 *
## age 0.8614 0.2482 3.470 0.00844 **
## weight 0.3349 0.1307 2.563 0.03351 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.318 on 8 degrees of freedom
## Multiple R-squared: 0.9768, Adjusted R-squared: 0.9711
## F-statistic: 168.8 on 2 and 8 DF, p-value: 2.874e-07
# predict blood pressure using bloodpressure_model :prediction
bloodpressure$prediction <- predict(bloodpressure_model)
# plot the results
ggplot(bloodpressure, aes(x=prediction, y=blood_pressure)) +
geom_point() +
geom_abline(color = "blue")
Chapter 2 - Training and Evaluating Regression Models
Evaluating models graphically:
Root Mean Squared Effort (RMSE):
R-Squared - value between 0 and 1 for quality of model fit:
Properly Training a Model:
Example code includes:
# The data frame unemployment and model unemployment_model are available in the workspace
# unemployment is in the workspace
summary(unemployment)
## male_unemployment female_unemployment prediction
## Min. :2.900 Min. :4.000 Min. :3.448
## 1st Qu.:4.900 1st Qu.:4.400 1st Qu.:4.837
## Median :6.000 Median :5.200 Median :5.601
## Mean :5.954 Mean :5.569 Mean :5.569
## 3rd Qu.:6.700 3rd Qu.:6.100 3rd Qu.:6.087
## Max. :9.800 Max. :7.900 Max. :8.240
# unemployment_model is in the workspace
summary(unemployment_model)
##
## Call:
## lm(formula = fmla, data = unemployment)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.77621 -0.34050 -0.09004 0.27911 1.31254
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.43411 0.60340 2.377 0.0367 *
## male_unemployment 0.69453 0.09767 7.111 1.97e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5803 on 11 degrees of freedom
## Multiple R-squared: 0.8213, Adjusted R-squared: 0.8051
## F-statistic: 50.56 on 1 and 11 DF, p-value: 1.966e-05
# Make predictions from the model
unemployment$predictions <- predict(unemployment_model)
# Fill in the blanks to plot predictions (on x-axis) versus the female_unemployment rates
ggplot(unemployment, aes(x = predictions, y = female_unemployment)) +
geom_point() +
geom_abline()
# Calculate residuals
unemployment$residuals <- residuals(unemployment_model)
# Fill in the blanks to plot predictions (on x-axis) versus the residuals
ggplot(unemployment, aes(x = predictions, y = residuals)) +
geom_pointrange(aes(ymin = 0, ymax = residuals)) +
geom_hline(yintercept = 0, linetype = 3) +
ggtitle("residuals vs. linear model prediction")
# In the previous exercise you made predictions about female_unemployment and visualized the predictions and the residuals
# Now, you will also plot the gain curve of the unemployment_model's predictions against actual female_unemployment using the WVPlots::GainCurvePlot() function
# For situations where order is more important than exact values, the gain curve helps you check if the model's predictions sort in the same order as the true outcome.
# Calls to the function GainCurvePlot() look like:
# GainCurvePlot(frame, xvar, truthvar, title)
# frame is a data frame
# xvar and truthvar are strings naming the prediction and actual outcome columns of frame
# title is the title of the plot
# When the predictions sort in exactly the same order, the relative Gini coefficient is 1
# When the model sorts poorly, the relative Gini coefficient is close to zero, or even negative
# The data frame unemployment and the model unemployment_model are in the workspace
# Load the package WVPlots
# library(WVPlots)
# Plot the Gain Curve
WVPlots::GainCurvePlot(unemployment, "predictions", "female_unemployment", "Unemployment model")
# For convenience put the residuals in the variable res
res <- unemployment$residuals
# Calculate RMSE, assign it to the variable rmse and print it
(rmse <- sqrt(mean(res**2)))
## [1] 0.5337612
# Calculate the standard deviation of female_unemployment and print it
(sd_unemployment <- sd(unemployment$female_unemployment))
## [1] 1.314271
# Calculate mean female_unemployment: fe_mean. Print it
(fe_mean <- mean(unemployment$female_unemployment))
## [1] 5.569231
# Calculate total sum of squares: tss. Print it
(tss <- sum((unemployment$female_unemployment - fe_mean)^2))
## [1] 20.72769
# Calculate residual sum of squares: rss. Print it
(rss <- sum(unemployment$residuals ** 2))
## [1] 3.703714
# Calculate R-squared: rsq. Print it. Is it a good fit?
(rsq <- 1 - rss/tss)
## [1] 0.8213157
# Get R-squared from glance. Print it
(rsq_glance <- broom::glance(unemployment_model)$r.squared)
## [1] 0.8213157
# Get the correlation between the prediction and true outcome: rho and print it
(rho <- cor(unemployment$female_unemployment, unemployment$predictions))
## [1] 0.9062647
# Square rho: rho2 and print it
(rho2 <- rho ** 2)
## [1] 0.8213157
# Get R-squared from glance and print it
(rsq_glance <- broom::glance(unemployment_model)$r.squared)
## [1] 0.8213157
# For the next several exercises you will use the mpg data from the package ggplot2
# The data describes the characteristics of several makes and models of cars from different years
# The goal is to predict city fuel efficiency from highway fuel efficiency
# In this exercise, you will split mpg into a training set mpg_train (75% of the data) and a test set mpg_test (25% of the data)
# One way to do this is to generate a column of uniform random numbers between 0 and 1, using the function runif()
# If you have a data set dframe of size NN, and you want a random subset of approximately size 100∗X100∗X% of NN (where XX is between 0 and 1), then
# Generate a vector of uniform random numbers: gp = runif(N).
# dframe[gp < X,] will be about the right size.
# dframe[gp >= X,] will be the complement.
mpg <- readRDS("./RInputFiles/mpg.rds")
summary(mpg)
## manufacturer model displ year
## Length:234 Length:234 Min. :1.600 Min. :1999
## Class :character Class :character 1st Qu.:2.400 1st Qu.:1999
## Mode :character Mode :character Median :3.300 Median :2004
## Mean :3.472 Mean :2004
## 3rd Qu.:4.600 3rd Qu.:2008
## Max. :7.000 Max. :2008
## cyl trans drv cty
## Min. :4.000 Length:234 Length:234 Min. : 9.00
## 1st Qu.:4.000 Class :character Class :character 1st Qu.:14.00
## Median :6.000 Mode :character Mode :character Median :17.00
## Mean :5.889 Mean :16.86
## 3rd Qu.:8.000 3rd Qu.:19.00
## Max. :8.000 Max. :35.00
## hwy fl class
## Min. :12.00 Length:234 Length:234
## 1st Qu.:18.00 Class :character Class :character
## Median :24.00 Mode :character Mode :character
## Mean :23.44
## 3rd Qu.:27.00
## Max. :44.00
dim(mpg)
## [1] 234 11
# Use nrow to get the number of rows in mpg (N) and print it
(N <- nrow(mpg))
## [1] 234
# Calculate how many rows 75% of N should be and print it
# Hint: use round() to get an integer
(target <- round(0.75 * N))
## [1] 176
# Create the vector of N uniform random variables: gp
gp <- runif(N)
# Use gp to create the training set: mpg_train (75% of data) and mpg_test (25% of data)
mpg_train <- mpg[gp <= 0.75, ]
mpg_test <- mpg[gp > 0.75, ]
# Use nrow() to examine mpg_train and mpg_test
nrow(mpg_train)
## [1] 190
nrow(mpg_test)
## [1] 44
# mpg_train is in the workspace
summary(mpg_train)
## manufacturer model displ year
## Length:190 Length:190 Min. :1.600 Min. :1999
## Class :character Class :character 1st Qu.:2.400 1st Qu.:1999
## Mode :character Mode :character Median :3.300 Median :1999
## Mean :3.469 Mean :2003
## 3rd Qu.:4.600 3rd Qu.:2008
## Max. :6.500 Max. :2008
## cyl trans drv cty
## Min. :4.000 Length:190 Length:190 Min. : 9.00
## 1st Qu.:4.000 Class :character Class :character 1st Qu.:14.00
## Median :6.000 Mode :character Mode :character Median :17.00
## Mean :5.916 Mean :17.01
## 3rd Qu.:8.000 3rd Qu.:19.00
## Max. :8.000 Max. :35.00
## hwy fl class
## Min. :12.00 Length:190 Length:190
## 1st Qu.:18.00 Class :character Class :character
## Median :25.00 Mode :character Mode :character
## Mean :23.68
## 3rd Qu.:27.00
## Max. :44.00
# Create a formula to express cty as a function of hwy: fmla and print it.
(fmla <- cty ~ hwy)
## cty ~ hwy
# Now use lm() to build a model mpg_model from mpg_train that predicts cty from hwy
mpg_model <- lm(fmla, data=mpg_train)
# Use summary() to examine the model
summary(mpg_model)
##
## Call:
## lm(formula = fmla, data = mpg_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9064 -0.7184 -0.0030 0.6241 4.0809
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.78448 0.35983 2.18 0.0305 *
## hwy 0.68488 0.01473 46.48 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.209 on 188 degrees of freedom
## Multiple R-squared: 0.92, Adjusted R-squared: 0.9195
## F-statistic: 2161 on 1 and 188 DF, p-value: < 2.2e-16
rmse <- function(predcol, ycol) {
res = predcol-ycol
sqrt(mean(res^2))
}
r_squared <- function(predcol, ycol) {
tss = sum( (ycol - mean(ycol))^2 )
rss = sum( (predcol - ycol)^2 )
1 - rss/tss
}
# predict cty from hwy for the training set
mpg_train$pred <- predict(mpg_model)
# predict cty from hwy for the test set
mpg_test$pred <- predict(mpg_model, newdata=mpg_test)
# Evaluate the rmse on both training and test data and print them
(rmse_train <- rmse(mpg_train$pred, mpg_train$cty))
## [1] 1.202648
(rmse_test <- rmse(mpg_test$pred, mpg_test$cty))
## [1] 1.423936
# Evaluate the r-squared on both training and test data.and print them
(rsq_train <- r_squared(mpg_train$pred, mpg_train$cty))
## [1] 0.9199508
(rsq_test <- r_squared(mpg_test$pred, mpg_test$cty))
## [1] 0.8834909
# Plot the predictions (on the x-axis) against the outcome (cty) on the test data
ggplot(mpg_test, aes(x = pred, y = cty)) +
geom_point() +
geom_abline()
# There are several ways to implement an n-fold cross validation plan
# In this exercise you will create such a plan using vtreat::kWayCrossValidation(), and examine it
# kWayCrossValidation() creates a cross validation plan with the following call:
# splitPlan <- kWayCrossValidation(nRows, nSplits, dframe, y)
# where nRows is the number of rows of data to be split, and nSplits is the desired number of cross-validation folds.
# Strictly speaking, dframe and y aren't used by kWayCrossValidation; they are there for compatibility with other vtreat data partitioning functions
# You can set them both to NULL
# The resulting splitPlan is a list of nSplits elements; each element contains two vectors:
# train: the indices of dframe that will form the training set
# app: the indices of dframe that will form the test (or application) set
# Get the number of rows in mpg
nRows <- nrow(mpg)
# Implement the 3-fold cross-fold plan with vtreat
splitPlan <- vtreat::kWayCrossValidation(nRows, 3, NULL, NULL)
# Examine the split plan
str(splitPlan)
## List of 3
## $ :List of 2
## ..$ train: int [1:156] 2 4 5 7 8 9 10 11 12 13 ...
## ..$ app : int [1:78] 165 116 205 26 212 158 192 155 135 136 ...
## $ :List of 2
## ..$ train: int [1:156] 1 2 3 4 5 6 7 14 15 16 ...
## ..$ app : int [1:78] 124 231 196 54 107 222 173 87 13 9 ...
## $ :List of 2
## ..$ train: int [1:156] 1 3 6 8 9 10 11 12 13 15 ...
## ..$ app : int [1:78] 169 76 110 50 168 109 179 214 119 120 ...
## - attr(*, "splitmethod")= chr "kwaycross"
# The data frame mpg, the cross validation plan splitPlan, and the function to calculate RMSE (rmse()) from one of the previous exercises is available in your workspace.
# Run the 3-fold cross validation plan from splitPlan
k <- 3 # Number of folds
mpg$pred.cv <- 0
for(i in 1:k) {
split <- splitPlan[[i]]
model <- lm(cty ~ hwy, data = mpg[split$train, ])
mpg$pred.cv[split$app] <- predict(model, newdata = mpg[split$app, ])
}
# Predict from a full model
mpg$pred <- predict(lm(cty ~ hwy, data = mpg))
# Get the rmse of the full model's predictions
rmse(mpg$pred, mpg$cty)
## [1] 1.247045
# Get the rmse of the cross-validation predictions
rmse(mpg$pred.cv, mpg$cty)
## [1] 1.264219
Chapter 3 - Issues to Consider
Categorical Inputs:
Interactions:
Transforming Response before Modeling:
Transforming Inputs Before Modeling:
Example code includes:
# For this exercise you will call model.matrix() to examine how R represents data with both categorical and numerical inputs for modeling
# The dataset flowers (derived from the Sleuth3 package) is loaded into your workspace
# It has the following columns:
# Flowers: the average number of flowers on a meadowfoam plant
# Intensity: the intensity of a light treatment applied to the plant
# Time: A categorical variable - when (Late or Early) in the lifecycle the light treatment occurred
# The ultimate goal is to predict Flowers as a function of Time and Intensity
# The data frame flowers is in your workspace.
flowers <- data.frame(Flowers=c(62.3, 77.4, 55.3, 54.2, 49.6, 61.9, 39.4, 45.7, 31.3, 44.9, 36.8, 41.9, 77.8, 75.6, 69.1, 78, 57, 71.1, 62.9, 52.2, 60.3, 45.6, 52.6, 44.4),
Intensity=c(150, 150, 300, 300, 450, 450, 600, 600, 750, 750, 900, 900, 150, 150, 300, 300, 450, 450, 600, 600, 750, 750, 900, 900),
Time=c('Late', 'Late', 'Late', 'Late', 'Late', 'Late', 'Late', 'Late', 'Late', 'Late', 'Late', 'Late', 'Early', 'Early', 'Early', 'Early', 'Early', 'Early', 'Early', 'Early', 'Early', 'Early', 'Early', 'Early'),
stringsAsFactors=FALSE
)
# Call str on flowers to see the types of each column
str(flowers)
## 'data.frame': 24 obs. of 3 variables:
## $ Flowers : num 62.3 77.4 55.3 54.2 49.6 61.9 39.4 45.7 31.3 44.9 ...
## $ Intensity: num 150 150 300 300 450 450 600 600 750 750 ...
## $ Time : chr "Late" "Late" "Late" "Late" ...
# Use unique() to see how many possible values Time takes
unique(flowers$Time)
## [1] "Late" "Early"
# Build a formula to express Flowers as a function of Intensity and Time: fmla. Print it
(fmla <- as.formula("Flowers ~ Intensity + Time"))
## Flowers ~ Intensity + Time
# Use fmla and model.matrix to see how the data is represented for modeling
mmat <- model.matrix(fmla, data=flowers)
# Examine the first 20 lines of flowers
head(flowers, n=20)
## Flowers Intensity Time
## 1 62.3 150 Late
## 2 77.4 150 Late
## 3 55.3 300 Late
## 4 54.2 300 Late
## 5 49.6 450 Late
## 6 61.9 450 Late
## 7 39.4 600 Late
## 8 45.7 600 Late
## 9 31.3 750 Late
## 10 44.9 750 Late
## 11 36.8 900 Late
## 12 41.9 900 Late
## 13 77.8 150 Early
## 14 75.6 150 Early
## 15 69.1 300 Early
## 16 78.0 300 Early
## 17 57.0 450 Early
## 18 71.1 450 Early
## 19 62.9 600 Early
## 20 52.2 600 Early
# Examine the first 20 lines of mmat
head(mmat, n=20)
## (Intercept) Intensity TimeLate
## 1 1 150 1
## 2 1 150 1
## 3 1 300 1
## 4 1 300 1
## 5 1 450 1
## 6 1 450 1
## 7 1 600 1
## 8 1 600 1
## 9 1 750 1
## 10 1 750 1
## 11 1 900 1
## 12 1 900 1
## 13 1 150 0
## 14 1 150 0
## 15 1 300 0
## 16 1 300 0
## 17 1 450 0
## 18 1 450 0
## 19 1 600 0
## 20 1 600 0
# Fit a model to predict Flowers from Intensity and Time : flower_model
flower_model <- lm(fmla, data=flowers)
# Use summary on mmat to remind yourself of its structure
summary(mmat)
## (Intercept) Intensity TimeLate
## Min. :1 Min. :150 Min. :0.0
## 1st Qu.:1 1st Qu.:300 1st Qu.:0.0
## Median :1 Median :525 Median :0.5
## Mean :1 Mean :525 Mean :0.5
## 3rd Qu.:1 3rd Qu.:750 3rd Qu.:1.0
## Max. :1 Max. :900 Max. :1.0
# Use summary to examine flower_model
summary(flower_model)
##
## Call:
## lm(formula = fmla, data = flowers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.652 -4.139 -1.558 5.632 12.165
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 83.464167 3.273772 25.495 < 2e-16 ***
## Intensity -0.040471 0.005132 -7.886 1.04e-07 ***
## TimeLate -12.158333 2.629557 -4.624 0.000146 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.441 on 21 degrees of freedom
## Multiple R-squared: 0.7992, Adjusted R-squared: 0.78
## F-statistic: 41.78 on 2 and 21 DF, p-value: 4.786e-08
# Predict the number of flowers on each plant
flowers$predictions <- predict(flower_model)
# Plot predictions vs actual flowers (predictions on x-axis)
ggplot(flowers, aes(x = predictions, y = Flowers)) +
geom_point() +
geom_abline(color = "blue")
# The data frame alcohol is in your workspace.
# alcohol is in the workspace
alcohol <- data.frame(Subject=1:32,
Metabol=c(0.6, 0.6, 1.5, 0.4, 0.1, 0.2, 0.3, 0.3, 0.4, 1, 1.1, 1.2, 1.3, 1.6, 1.8, 2, 2.5, 2.9, 1.5, 1.9, 2.7, 3, 3.7, 0.3, 2.5, 2.7, 3, 4, 4.5, 6.1, 9.5, 12.3),
Gastric=c(1, 1.6, 1.5, 2.2, 1.1, 1.2, 0.9, 0.8, 1.5, 0.9, 1.6, 1.7, 1.7, 2.2, 0.8, 2, 3, 2.2, 1.3, 1.2, 1.4, 1.3, 2.7, 1.1, 2.3, 2.7, 1.4, 2.2, 2, 2.8, 5.2, 4.1),
Sex=c('Female', 'Female', 'Female', 'Female', 'Female', 'Female', 'Female', 'Female', 'Female', 'Female', 'Female', 'Female', 'Female', 'Female', 'Female', 'Female', 'Female', 'Female', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male'),
Alcohol=c('Alcoholic', 'Alcoholic', 'Alcoholic', 'Non-alcoholic', 'Non-alcoholic', 'Non-alcoholic', 'Non-alcoholic', 'Non-alcoholic', 'Non-alcoholic', 'Non-alcoholic', 'Non-alcoholic', 'Non-alcoholic', 'Non-alcoholic', 'Non-alcoholic', 'Non-alcoholic', 'Non-alcoholic', 'Non-alcoholic', 'Non-alcoholic', 'Alcoholic', 'Alcoholic', 'Alcoholic', 'Alcoholic', 'Alcoholic', 'Non-alcoholic', 'Non-alcoholic', 'Non-alcoholic', 'Non-alcoholic', 'Non-alcoholic', 'Non-alcoholic', 'Non-alcoholic', 'Non-alcoholic', 'Non-alcoholic'),
stringsAsFactors = TRUE
)
summary(alcohol)
## Subject Metabol Gastric Sex
## Min. : 1.00 Min. : 0.100 Min. :0.800 Female:18
## 1st Qu.: 8.75 1st Qu.: 0.600 1st Qu.:1.200 Male :14
## Median :16.50 Median : 1.700 Median :1.600
## Mean :16.50 Mean : 2.422 Mean :1.863
## 3rd Qu.:24.25 3rd Qu.: 2.925 3rd Qu.:2.200
## Max. :32.00 Max. :12.300 Max. :5.200
## Alcohol
## Alcoholic : 8
## Non-alcoholic:24
##
##
##
##
# Create the formula with main effects only
(fmla_add <- Metabol ~ Gastric + Sex)
## Metabol ~ Gastric + Sex
# Create the formula with interactions
(fmla_interaction <- Metabol ~ Gastric + Gastric:Sex )
## Metabol ~ Gastric + Gastric:Sex
# Fit the main effects only model
model_add <- lm(fmla_add, data=alcohol)
# Fit the interaction model
model_interaction <- lm(fmla_interaction, data=alcohol)
# Call summary on both models and compare
summary(model_add)
##
## Call:
## lm(formula = fmla_add, data = alcohol)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2779 -0.6328 -0.0966 0.5783 4.5703
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.9466 0.5198 -3.745 0.000796 ***
## Gastric 1.9656 0.2674 7.352 4.24e-08 ***
## SexMale 1.6174 0.5114 3.163 0.003649 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.331 on 29 degrees of freedom
## Multiple R-squared: 0.7654, Adjusted R-squared: 0.7492
## F-statistic: 47.31 on 2 and 29 DF, p-value: 7.41e-10
summary(model_interaction)
##
## Call:
## lm(formula = fmla_interaction, data = alcohol)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4656 -0.5091 0.0143 0.5660 4.0668
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.7504 0.5310 -1.413 0.168236
## Gastric 1.1489 0.3450 3.331 0.002372 **
## Gastric:SexMale 1.0422 0.2412 4.321 0.000166 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.204 on 29 degrees of freedom
## Multiple R-squared: 0.8081, Adjusted R-squared: 0.7948
## F-statistic: 61.05 on 2 and 29 DF, p-value: 4.033e-11
# Create the splitting plan for 3-fold cross validation
set.seed(34245) # set the seed for reproducibility
splitPlan <- vtreat::kWayCrossValidation(nrow(alcohol), 3, NULL, NULL)
# Sample code: Get cross-val predictions for main-effects only model
alcohol$pred_add <- 0 # initialize the prediction vector
for(i in 1:3) {
split <- splitPlan[[i]]
model_add <- lm(fmla_add, data = alcohol[split$train, ])
alcohol$pred_add[split$app] <- predict(model_add, newdata = alcohol[split$app, ])
}
# Get the cross-val predictions for the model with interactions
alcohol$pred_interaction <- 0 # initialize the prediction vector
for(i in 1:3) {
split <- splitPlan[[i]]
model_interaction <- lm(fmla_interaction, data = alcohol[split$train, ])
alcohol$pred_interaction[split$app] <- predict(model_interaction, newdata = alcohol[split$app, ])
}
# Get RMSE
alcohol %>%
tidyr::gather(key = modeltype, value = pred, pred_add, pred_interaction) %>%
mutate(residuals = Metabol-pred) %>%
group_by(modeltype) %>%
summarize(rmse = sqrt(mean(residuals^2)))
## # A tibble: 2 x 2
## modeltype rmse
## <chr> <dbl>
## 1 pred_add 1.64
## 2 pred_interaction 1.38
# The example (toy) dataset fdata is loaded in your workspace. It includes the columns:
# y: the true output to be predicted by some model; imagine it is the amount of money a customer will spend on a visit to your store.
# pred: the predictions of a model that predicts y.
# label: categorical: whether y comes from a population that makes small purchases, or large ones.
fdata <- data.frame(y=c(9.15, 1.9, -3.86, 2.39, 1.54, 13.56, 10.2, 1.1, 3.94, 9.04, 1.73, 15.72, 2.26, -1.98, 1.1, 18.63, 3.68, 3.09, 8.69, 7.91, 5.44, 14.79, 9.02, 3.98, 2.67, 7.68, 11.93, 5.31, 13.06, 2.23, 15.4, -0.88, 7.61, 9.86, 4.36, 3.84, 11.34, 17.13, 16.17, -5.89, 12.64, 6.45, 2.97, 4.08, 5.52, 4.83, 6.72, 1.84, 3.2, 10.82, 1026.4, 202.39, 833.35, 1075.41, 96.12, 438.24, 911.33, 542.56, 686.33, 494.47, 422.81, 1033.88, 161.99, 491.43, 575.92, 384.77, 720.03, 963.94, 159.79, 765.4, 246.42, 1097.92, 1050.4, 1069.62, 116.81, 523.52, 457.74, 1060.56, 761.92, 969.32, 522.68, 475.87, 368.54, 1101.62, 1052.92, 663.04, 136.69, 331.81, 921.96, 773.58, 458.33, 643.47, 738.65, 846.42, 413.66, 180.99, 695.62, 164.9, 106.51, 358.36),
pred=c(6.43, 3.47, 1.59, 3.76, 9.51, 6.93, 8.19, 1.51, 8.99, 6.15, 8.5, 10.94, 6, 1.07, 4.42, 10.52, 5.75, 7.09, 7.84, 4.31, 6, 8.31, 8.63, 2.98, 4.04, 7.46, 9.08, 6.52, 10.34, 4.24, 8.83, 1.17, 4.74, 8.43, 10.19, 4.33, 6.39, 11, 7.52, 2.45, 9.51, 5.21, 3.97, 6.32, 10.17, 3.85, 6.3, 3.65, 2.52, 8.02, 1027.19, 194.52, 826.25, 1081.44, 100.39, 430.29, 912.57, 533.95, 691.79, 498.01, 423.2, 1032.66, 168.55, 492.24, 589.71, 377.15, 730.11, 967.65, 159.21, 767.19, 250.13, 1098.9, 1048.87, 1057.99, 119.28, 524.56, 459.77, 1053.65, 751.11, 966.38, 520.61, 467.72, 364.21, 1097.07, 1054.63, 664.37, 137.49, 326.09, 929.97, 772.13, 456.43, 646.67, 747, 842.53, 411.89, 175.77, 687.86, 165.81, 108.48, 363.87),
label=c(rep("small purchases", 50), rep("large purchases", 50)),
stringsAsFactors=TRUE
)
# fdata is in the workspace
summary(fdata)
## y pred label
## Min. : -5.890 Min. : 1.070 large purchases:50
## 1st Qu.: 5.407 1st Qu.: 6.372 small purchases:50
## Median : 57.375 Median : 55.695
## Mean : 306.203 Mean : 305.904
## 3rd Qu.: 550.900 3rd Qu.: 547.890
## Max. :1101.620 Max. :1098.900
# Examine the data: generate the summaries for the groups large and small:
fdata %>%
group_by(label) %>% # group by small/large purchases
summarize(min = min(y), # min of y
mean = mean(y), # mean of y
max = max(y)) # max of y
## # A tibble: 2 x 4
## label min mean max
## <fct> <dbl> <dbl> <dbl>
## 1 large purchases 96.1 606 1102
## 2 small purchases - 5.89 6.48 18.6
# Fill in the blanks to add error columns
fdata2 <- fdata %>%
group_by(label) %>% # group by label
mutate(residual = pred - y, # Residual
relerr = residual / y) # Relative error
# Compare the rmse and rmse.rel of the large and small groups:
fdata2 %>%
group_by(label) %>%
summarize(rmse = sqrt(mean(residual ** 2)), # RMSE
rmse.rel = sqrt(mean(relerr ** 2))) # Root mean squared relative error
## # A tibble: 2 x 3
## label rmse rmse.rel
## <fct> <dbl> <dbl>
## 1 large purchases 5.54 0.0147
## 2 small purchases 4.02 1.25
# Plot the predictions for both groups of purchases
ggplot(fdata2, aes(x = pred, y = y, color = label)) +
geom_point() +
geom_abline() +
facet_wrap(~ label, ncol = 1, scales = "free") +
ggtitle("Outcome vs prediction")
# In this exercise, you will practice modeling on log-transformed monetary output, and then transforming the "log-money" predictions back into monetary units
# The data loaded into your workspace records subjects' incomes in 2005 (Income2005), as well as the results of several aptitude tests taken by the subjects in 1981:
# Arith
# Word
# Parag
# Math
# AFQT (Percentile on the Armed Forces Qualifying Test)
# The data have already been split into training and test sets (income_train and income_test respectively) and are in the workspace
# You will build a model of log(income) from the inputs, and then convert log(income) back into income.
# Examine Income2005 in the training set (do not have data)
# summary(income_train$Income2005)
# Write the formula for log income as a function of the tests and print it
# (fmla.log <- log(Income2005) ~ Arith + Word + Parag + Math + AFQT)
# Fit the linear model
# model.log <- lm(fmla.log, data=income_train)
# Make predictions on income_test
# income_test$logpred <- predict(model.log, newdata=income_test)
# summary(income_test$logpred)
# Convert the predictions to monetary units
# income_test$pred.income <- exp(income_test$logpred)
# summary(income_test$pred.income)
# Plot predicted income (x axis) vs income
# ggplot(income_test, aes(x = pred.income, y = Income2005)) +
# geom_point() +
# geom_abline(color = "blue")
# The income_train and income_test datasets are loaded in your workspace, along with your model, model.log
# Also in the workspace:
# model.abs: a model that directly fits income to the inputs using the formula Income2005 ~ Arith + Word + Parag + Math + AFQT
# fmla.abs is in the workspace
# fmla.abs
# model.abs is in the workspace
# summary(model.abs)
# Add predictions to the test set
# income_test <- income_test %>%
# mutate(pred.absmodel = predict(model.abs, income_test), # predictions from model.abs
# pred.logmodel = exp(predict(model.log, income_test))) # predictions from model.log
# Gather the predictions and calculate residuals and relative error
# income_long <- income_test %>%
# gather(key = modeltype, value = pred, pred.absmodel, pred.logmodel) %>%
# mutate(residual = pred - Income2005, # residuals
# relerr = residual / Income2005) # relative error
# Calculate RMSE and relative RMSE and compare
# income_long %>%
# group_by(modeltype) %>% # group by modeltype
# summarize(rmse = sqrt(mean(residual**2)), # RMSE
# rmse.rel = sqrt(mean(relerr**2))) # Root mean squared relative error
# In this exercise, we will build a model to predict price from a measure of the house's size (surface area)
# The data set houseprice has the columns:
# price : house price in units of $1000
# size: surface area
# A scatterplot of the data shows that the data is quite non-linear: a sort of "hockey-stick" where price is fairly flat for smaller houses, but rises steeply as the house gets larger
# Quadratics and tritics are often good functional forms to express hockey-stick like relationships
# Note that there may not be a "physical" reason that price is related to the square of the size; a quadratic is simply a closed form approximation of the observed relationship
# You will fit a model to predict price as a function of the squared size, and look at its fit on the training data
# Because ^ is also a symbol to express interactions, use the function I() to treat the expression x^2 “as is”: that is, as the square of x rather than the interaction of x with itself
# exampleFormula = y ~ I(x^2)
houseprice <- data.frame(size=c(72, 98, 92, 90, 44, 46, 90, 150, 94, 90, 90, 66, 142, 74, 86, 46, 54, 130, 122, 118, 100, 74, 146, 92, 100, 140, 94, 90, 120, 70, 100, 132, 58, 92, 76, 90, 66, 134, 140, 64),
price=c(156, 153, 230, 152, 42, 157, 113, 573, 202, 261, 175, 212, 486, 109, 220, 186, 133, 360, 283, 380, 185, 186, 459, 167, 171, 547, 170, 286, 293, 109, 205, 514, 175, 249, 234, 242, 177, 399, 511, 107)
)
# The data set houseprice is in the workspace.
# houseprice is in the workspace
summary(houseprice)
## size price
## Min. : 44.0 Min. : 42.0
## 1st Qu.: 73.5 1st Qu.:164.5
## Median : 91.0 Median :203.5
## Mean : 94.3 Mean :249.2
## 3rd Qu.:118.5 3rd Qu.:287.8
## Max. :150.0 Max. :573.0
# Create the formula for price as a function of squared size
(fmla_sqr <- price ~ I(size**2))
## price ~ I(size^2)
# Fit a model of price as a function of squared size (use fmla_sqr)
model_sqr <- lm(fmla_sqr, data=houseprice)
# Fit a model of price as a linear function of size
model_lin <- lm(price ~ size, data=houseprice)
# Make predictions and compare
houseprice %>%
mutate(pred_lin = predict(model_lin), # predictions from linear model
pred_sqr = predict(model_sqr)) %>% # predictions from quadratic model
tidyr::gather(key = modeltype, value = pred, pred_lin, pred_sqr) %>% # gather the predictions
ggplot(aes(x = size)) +
geom_point(aes(y = price)) + # actual prices
geom_line(aes(y = pred, color = modeltype)) + # the predictions
scale_color_brewer(palette = "Dark2")
# Create a splitting plan for 3-fold cross validation
set.seed(34245) # set the seed for reproducibility
splitPlan <- vtreat::kWayCrossValidation(nrow(houseprice), 3, NULL, NULL)
# Sample code: get cross-val predictions for price ~ size
houseprice$pred_lin <- 0 # initialize the prediction vector
for(i in 1:3) {
split <- splitPlan[[i]]
model_lin <- lm(price ~ size, data = houseprice[split$train,])
houseprice$pred_lin[split$app] <- predict(model_lin, newdata = houseprice[split$app,])
}
# Get cross-val predictions for price as a function of size^2 (use fmla_sqr)
houseprice$pred_sqr <- 0 # initialize the prediction vector
for(i in 1:3) {
split <- splitPlan[[i]]
model_sqr <- lm(fmla_sqr, data = houseprice[split$train, ])
houseprice$pred_sqr[split$app] <- predict(model_sqr, newdata = houseprice[split$app, ])
}
# Gather the predictions and calculate the residuals
houseprice_long <- houseprice %>%
tidyr::gather(key = modeltype, value = pred, pred_lin, pred_sqr) %>%
mutate(residuals = pred - price)
# Compare the cross-validated RMSE for the two models
houseprice_long %>%
group_by(modeltype) %>% # group by modeltype
summarize(rmse = sqrt(mean(residuals**2)))
## # A tibble: 2 x 2
## modeltype rmse
## <chr> <dbl>
## 1 pred_lin 74.3
## 2 pred_sqr 63.7
Chapter 4 - Dealing with Non-Linear Responses
Logistic regression for predicting probabilities:
Poisson and quasipoisson regression to predict counts:
GAM to learn non-linear transformations:
Example code includes:
# In this exercise, you will estimate the probability that a sparrow survives a severe winter storm, based on physical characteristics of the sparrow
# The dataset sparrow is loaded into your workspace
# The outcome to be predicted is status ("Survived", "Perished")
# The variables we will consider are:
# total_length: length of the bird from tip of beak to tip of tail (mm)
# weight: in grams
# humerus : length of humerus ("upper arm bone" that connects the wing to the body) (inches)
# Remember that when using glm() to create a logistic regression model, you must explicitly specify that family = binomial: glm(formula, data = data, family = binomial)
# The data frame sparrow and the package broom are loaded in the workspace.
sparrow <- readRDS("./RInputFiles/sparrow.rds")
# sparrow is in the workspace
summary(sparrow)
## status age total_length wingspan
## Perished:36 Length:87 Min. :153.0 Min. :236.0
## Survived:51 Class :character 1st Qu.:158.0 1st Qu.:245.0
## Mode :character Median :160.0 Median :247.0
## Mean :160.4 Mean :247.5
## 3rd Qu.:162.5 3rd Qu.:251.0
## Max. :167.0 Max. :256.0
## weight beak_head humerus femur
## Min. :23.2 Min. :29.80 Min. :0.6600 Min. :0.6500
## 1st Qu.:24.7 1st Qu.:31.40 1st Qu.:0.7250 1st Qu.:0.7000
## Median :25.8 Median :31.70 Median :0.7400 Median :0.7100
## Mean :25.8 Mean :31.64 Mean :0.7353 Mean :0.7134
## 3rd Qu.:26.7 3rd Qu.:32.10 3rd Qu.:0.7500 3rd Qu.:0.7300
## Max. :31.0 Max. :33.00 Max. :0.7800 Max. :0.7600
## legbone skull sternum
## Min. :1.010 Min. :0.5600 Min. :0.7700
## 1st Qu.:1.110 1st Qu.:0.5900 1st Qu.:0.8300
## Median :1.130 Median :0.6000 Median :0.8500
## Mean :1.131 Mean :0.6032 Mean :0.8511
## 3rd Qu.:1.160 3rd Qu.:0.6100 3rd Qu.:0.8800
## Max. :1.230 Max. :0.6400 Max. :0.9300
# Create the survived column
sparrow$survived <- sparrow$status == "Survived"
# Create the formula
(fmla <- survived ~ total_length + weight + humerus)
## survived ~ total_length + weight + humerus
# Fit the logistic regression model
sparrow_model <- glm(fmla, data=sparrow, family="binomial")
# Call summary
summary(sparrow_model)
##
## Call:
## glm(formula = fmla, family = "binomial", data = sparrow)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1117 -0.6026 0.2871 0.6577 1.7082
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 46.8813 16.9631 2.764 0.005715 **
## total_length -0.5435 0.1409 -3.858 0.000115 ***
## weight -0.5689 0.2771 -2.053 0.040060 *
## humerus 75.4610 19.1586 3.939 8.19e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 118.008 on 86 degrees of freedom
## Residual deviance: 75.094 on 83 degrees of freedom
## AIC: 83.094
##
## Number of Fisher Scoring iterations: 5
# Call glance
(perf <- broom::glance(sparrow_model))
## null.deviance df.null logLik AIC BIC deviance df.residual
## 1 118.0084 86 -37.54718 83.09436 92.95799 75.09436 83
# Calculate pseudo-R-squared
(pseudoR2 <- 1 - perf$deviance / perf$null.deviance)
## [1] 0.3636526
# Recall that when calling predict() to get the predicted probabilities from a glm() model, you must specify that you want the response:
# predict(model, type = "response")
# Otherwise, predict() on a logistic regression model returns the predicted log-odds of the event, not the probability.
# You will also use the GainCurvePlot() function to plot the gain curve from the model predictions
# GainCurvePlot(frame, xvar, truthVar, title)
# Make predictions
sparrow$pred <- predict(sparrow_model, type="response")
# Look at gain curve
WVPlots::GainCurvePlot(sparrow, "pred", "survived", "sparrow survival model")
# In this exercise you will build a model to predict the number of bikes rented in an hour as a function of the weather, the type of day (holiday, working day, or weekend), and the time of day
# You will train the model on data from the month of July
# Remember that you must specify family = poisson or family = quasipoisson when using glm() to fit a count model
# Since there are a lot of input variables, for convenience we will specify the outcome and the inputs in variables, and use paste() to assemble a string representing the model formula.
# The data frame bikesJuly is in the workspace
# The names of the outcome variable and the input variables are also in the workspace as the variables outcome and vars respectively
load("./RInputFiles/Bikes.RData")
outcome <- "cnt"
vars <- c("hr", "holiday", "workingday", "weathersit", "temp", "atemp", "hum", "windspeed")
# bikesJuly is in the workspace
str(bikesJuly)
## 'data.frame': 744 obs. of 12 variables:
## $ hr : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ holiday : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ workingday: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ weathersit: chr "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
## $ temp : num 0.76 0.74 0.72 0.72 0.7 0.68 0.7 0.74 0.78 0.82 ...
## $ atemp : num 0.727 0.697 0.697 0.712 0.667 ...
## $ hum : num 0.66 0.7 0.74 0.84 0.79 0.79 0.79 0.7 0.62 0.56 ...
## $ windspeed : num 0 0.1343 0.0896 0.1343 0.194 ...
## $ cnt : int 149 93 90 33 4 10 27 50 142 219 ...
## $ instant : int 13004 13005 13006 13007 13008 13009 13010 13011 13012 13013 ...
## $ mnth : int 7 7 7 7 7 7 7 7 7 7 ...
## $ yr : int 1 1 1 1 1 1 1 1 1 1 ...
# Create the formula string for bikes rented as a function of the inputs
(fmla <- paste(outcome, "~", paste(vars, collapse = " + ")))
## [1] "cnt ~ hr + holiday + workingday + weathersit + temp + atemp + hum + windspeed"
# Calculate the mean and variance of the outcome
(mean_bikes <- mean(bikesJuly$cnt))
## [1] 273.6653
(var_bikes <- var(bikesJuly$cnt))
## [1] 45863.84
# Fit the model
bike_model <- glm(fmla, data=bikesJuly, family=quasipoisson)
# Call glance
(perf <- broom::glance(bike_model))
## null.deviance df.null logLik AIC BIC deviance df.residual
## 1 133364.9 743 NA NA NA 28774.9 712
# Calculate pseudo-R-squared
(pseudoR2 <- 1 - perf$deviance / perf$null.deviance)
## [1] 0.7842393
# In this exercise you will use the model you built in the previous exercise to make predictions for the month of August
# The data set bikesAugust has the same columns as bikesJuly
# Recall that you must specify type = "response" with predict() when predicting counts from a glm poisson or quasipoisson model
# bikesAugust is in the workspace
str(bikesAugust)
## 'data.frame': 744 obs. of 12 variables:
## $ hr : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ holiday : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ workingday: logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ weathersit: chr "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
## $ temp : num 0.68 0.66 0.64 0.64 0.64 0.64 0.64 0.64 0.66 0.68 ...
## $ atemp : num 0.636 0.606 0.576 0.576 0.591 ...
## $ hum : num 0.79 0.83 0.83 0.83 0.78 0.78 0.78 0.83 0.78 0.74 ...
## $ windspeed : num 0.1642 0.0896 0.1045 0.1045 0.1343 ...
## $ cnt : int 47 33 13 7 4 49 185 487 681 350 ...
## $ instant : int 13748 13749 13750 13751 13752 13753 13754 13755 13756 13757 ...
## $ mnth : int 8 8 8 8 8 8 8 8 8 8 ...
## $ yr : int 1 1 1 1 1 1 1 1 1 1 ...
# bike_model is in the workspace
summary(bike_model)
##
## Call:
## glm(formula = fmla, family = quasipoisson, data = bikesJuly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -21.6117 -4.3121 -0.7223 3.5507 16.5079
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.934986 0.439027 13.519 < 2e-16 ***
## hr1 -0.580055 0.193354 -3.000 0.002794 **
## hr2 -0.892314 0.215452 -4.142 3.86e-05 ***
## hr3 -1.662342 0.290658 -5.719 1.58e-08 ***
## hr4 -2.350204 0.393560 -5.972 3.71e-09 ***
## hr5 -1.084289 0.230130 -4.712 2.96e-06 ***
## hr6 0.211945 0.156476 1.354 0.176012
## hr7 1.211135 0.132332 9.152 < 2e-16 ***
## hr8 1.648361 0.127177 12.961 < 2e-16 ***
## hr9 1.155669 0.133927 8.629 < 2e-16 ***
## hr10 0.993913 0.137096 7.250 1.09e-12 ***
## hr11 1.116547 0.136300 8.192 1.19e-15 ***
## hr12 1.282685 0.134769 9.518 < 2e-16 ***
## hr13 1.273010 0.135872 9.369 < 2e-16 ***
## hr14 1.237721 0.136386 9.075 < 2e-16 ***
## hr15 1.260647 0.136144 9.260 < 2e-16 ***
## hr16 1.515893 0.132727 11.421 < 2e-16 ***
## hr17 1.948404 0.128080 15.212 < 2e-16 ***
## hr18 1.893915 0.127812 14.818 < 2e-16 ***
## hr19 1.669277 0.128471 12.993 < 2e-16 ***
## hr20 1.420732 0.131004 10.845 < 2e-16 ***
## hr21 1.146763 0.134042 8.555 < 2e-16 ***
## hr22 0.856182 0.138982 6.160 1.21e-09 ***
## hr23 0.479197 0.148051 3.237 0.001265 **
## holidayTRUE 0.201598 0.079039 2.551 0.010961 *
## workingdayTRUE 0.116798 0.033510 3.485 0.000521 ***
## weathersitLight Precipitation -0.214801 0.072699 -2.955 0.003233 **
## weathersitMisty -0.010757 0.038600 -0.279 0.780572
## temp -3.246001 1.148270 -2.827 0.004833 **
## atemp 2.042314 0.953772 2.141 0.032589 *
## hum -0.748557 0.236015 -3.172 0.001581 **
## windspeed 0.003277 0.148814 0.022 0.982439
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 38.98949)
##
## Null deviance: 133365 on 743 degrees of freedom
## Residual deviance: 28775 on 712 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 5
# Make predictions on August data
bikesAugust$pred <- predict(bike_model, newdata=bikesAugust, type="response")
# Calculate the RMSE
bikesAugust %>%
mutate(residual = pred - cnt) %>%
summarize(rmse = sqrt(mean(residual**2)))
## rmse
## 1 112.5815
# Plot predictions vs cnt (pred on x-axis)
ggplot(bikesAugust, aes(x = pred, y = cnt)) +
geom_point() +
geom_abline(color = "darkblue")
# In the previous exercise, you visualized the bike model's predictions using the standard "outcome vs. prediction" scatter plot
# Since the bike rental data is time series data, you might be interested in how the model performs as a function of time
# In this exercise, you will compare the predictions and actual rentals on an hourly basis, for the first 14 days of August
# To create the plot you will use the function tidyr::gather() to consolidate the predicted and actual values from bikesAugust in a single column
# gather() takes as arguments:
# The "wide" data frame to be gathered (implicit in a pipe)
# The name of the key column to be created - contains the names of the gathered columns.
# The name of the value column to be created - contains the values of the gathered columns.
# The names of the columns to be gathered into a single column.
# You'll use the gathered data frame to compare the actual and predicted rental counts as a function of time
# The time index, instant counts the number of observations since the beginning of data collection
# The sample code converts the instants to daily units, starting from 0
# The data frame bikesAugust, with the predictions (bikesAugust$pred) is in the workspace.
# Plot predictions and cnt by date/time
quasipoisson_plot <- bikesAugust %>%
# set start to 0, convert unit to days
mutate(instant = (instant - min(instant))/24) %>%
# gather cnt and pred into a value column
tidyr::gather(key = valuetype, value = value, cnt, pred) %>%
filter(instant < 14) %>% # restric to first 14 days
# plot value by instant
ggplot(aes(x = instant, y = value, color = valuetype, linetype = valuetype)) +
geom_point() +
geom_line() +
scale_x_continuous("Day", breaks = 0:14, labels = 0:14) +
scale_color_brewer(palette = "Dark2") +
ggtitle("Predicted August bike rentals, Quasipoisson model")
quasipoisson_plot
# In this exercise you will model the average leaf weight on a soybean plant as a function of time (after planting)
# As you will see, the soybean plant doesn't grow at a steady rate, but rather has a "growth spurt" that eventually tapers off
# Hence, leaf weight is not well described by a linear model.
# Recall that you can designate which variable you want to model non-linearly in a formula with the s() function:
# Also remember that gam() from the package mgcv has the calling interface gam(formula, family, data)
# For standard regression, use family = gaussian (the default).
# The soybean training data, soybean_train is loaded into your workspace
# It has two columns: the outcome weight and the variable Time
# For comparison, the linear model model.lin, which was fit using the formula weight ~ Time has already been loaded into the workspace as well
load("./RInputFiles/Soybean.RData")
# soybean_train is in the workspace
summary(soybean_train)
## Plot Variety Year Time weight
## 1988F6 : 10 F:161 1988:124 Min. :14.00 Min. : 0.0290
## 1988F7 : 9 P:169 1989:102 1st Qu.:27.00 1st Qu.: 0.6663
## 1988P1 : 9 1990:104 Median :42.00 Median : 3.5233
## 1988P8 : 9 Mean :43.56 Mean : 6.1645
## 1988P2 : 9 3rd Qu.:56.00 3rd Qu.:10.3808
## 1988F3 : 8 Max. :84.00 Max. :27.3700
## (Other):276
# Plot weight vs Time (Time on x axis)
ggplot(soybean_train, aes(x = Time, y = weight)) +
geom_point()
# Create the formula
(fmla.gam <- weight ~ s(Time) )
## weight ~ s(Time)
# Fit the GAM Model
model.gam <- mgcv::gam(fmla.gam, data=soybean_train, family="gaussian")
# Call summary() on model.lin and look for R-squared
model.lin <- lm(weight ~ Time, data=soybean_train)
summary(model.lin)
##
## Call:
## lm(formula = weight ~ Time, data = soybean_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3933 -1.7100 -0.3909 1.9056 11.4381
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.559283 0.358527 -18.30 <2e-16 ***
## Time 0.292094 0.007444 39.24 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.778 on 328 degrees of freedom
## Multiple R-squared: 0.8244, Adjusted R-squared: 0.8238
## F-statistic: 1540 on 1 and 328 DF, p-value: < 2.2e-16
# Call summary() on model.gam and look for R-squared
summary(model.gam)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## weight ~ s(Time)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.1645 0.1143 53.93 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Time) 8.495 8.93 338.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.902 Deviance explained = 90.4%
## GCV = 4.4395 Scale est. = 4.3117 n = 330
# Call plot() on model.gam
plot(model.gam)
# The data frame soybean.test and the models model.lin and model.gam are in the workspace
# soybean_test is in the workspace
summary(soybean_test)
## Plot Variety Year Time weight
## 1988F8 : 4 F:43 1988:32 Min. :14.00 Min. : 0.0380
## 1988P7 : 4 P:39 1989:26 1st Qu.:23.00 1st Qu.: 0.4248
## 1989F8 : 4 1990:24 Median :41.00 Median : 3.0025
## 1990F8 : 4 Mean :44.09 Mean : 7.1576
## 1988F4 : 3 3rd Qu.:69.00 3rd Qu.:15.0113
## 1988F2 : 3 Max. :84.00 Max. :30.2717
## (Other):60
# Get predictions from linear model
soybean_test$pred.lin <- predict(model.lin, newdata = soybean_test)
# Get predictions from gam model
soybean_test$pred.gam <- as.numeric(predict(model.gam, newdata = soybean_test))
# Gather the predictions into a "long" dataset
soybean_long <- soybean_test %>%
tidyr::gather(key = modeltype, value = pred, pred.lin, pred.gam)
# Calculate the rmse
soybean_long %>%
mutate(residual = weight - pred) %>% # residuals
group_by(modeltype) %>% # group by modeltype
summarize(rmse = sqrt(mean(residual**2))) # calculate the RMSE
## # A tibble: 2 x 2
## modeltype rmse
## <chr> <dbl>
## 1 pred.gam 2.29
## 2 pred.lin 3.19
# Compare the predictions against actual weights on the test data
soybean_long %>%
ggplot(aes(x = Time)) + # the column for the x axis
geom_point(aes(y = weight)) + # the y-column for the scatterplot
geom_point(aes(y = pred, color = modeltype)) + # the y-column for the point-and-line plot
geom_line(aes(y = pred, color = modeltype, linetype = modeltype)) + # the y-column for the point-and-line plot
scale_color_brewer(palette = "Dark2")
Chapter 5 - Tree Based Models
The intuition behind tree-based models:
Random forests:
One-Hot-Encoding caregorical variables:
Gradient boosting machines:
Example code includes:
# Since there are a lot of input variables, for convenience we will specify the outcome and the inputs in the variables outcome and vars, and use paste() to assemble a string representing the model formula.
# The data frame bikesJuly is in the workspace. The sample code specifies the names of the outcome and input variables.
# bikesJuly is in the workspace
str(bikesJuly)
## 'data.frame': 744 obs. of 12 variables:
## $ hr : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ holiday : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ workingday: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ weathersit: chr "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
## $ temp : num 0.76 0.74 0.72 0.72 0.7 0.68 0.7 0.74 0.78 0.82 ...
## $ atemp : num 0.727 0.697 0.697 0.712 0.667 ...
## $ hum : num 0.66 0.7 0.74 0.84 0.79 0.79 0.79 0.7 0.62 0.56 ...
## $ windspeed : num 0 0.1343 0.0896 0.1343 0.194 ...
## $ cnt : int 149 93 90 33 4 10 27 50 142 219 ...
## $ instant : int 13004 13005 13006 13007 13008 13009 13010 13011 13012 13013 ...
## $ mnth : int 7 7 7 7 7 7 7 7 7 7 ...
## $ yr : int 1 1 1 1 1 1 1 1 1 1 ...
# Random seed to reproduce results
seed <- 1804240829
# The outcome column
(outcome <- "cnt")
## [1] "cnt"
# The input variables
(vars <- c("hr", "holiday", "workingday", "weathersit", "temp", "atemp", "hum", "windspeed"))
## [1] "hr" "holiday" "workingday" "weathersit" "temp"
## [6] "atemp" "hum" "windspeed"
# Create the formula string for bikes rented as a function of the inputs
(fmla <- paste(outcome, "~", paste(vars, collapse = " + ")))
## [1] "cnt ~ hr + holiday + workingday + weathersit + temp + atemp + hum + windspeed"
# Load the package ranger
library(ranger)
# Fit and print the random forest model
(bike_model_rf <- ranger(fmla, # formula
bikesJuly, # data
num.trees = 500,
respect.unordered.factors = "order",
seed = seed))
## Ranger result
##
## Call:
## ranger(fmla, bikesJuly, num.trees = 500, respect.unordered.factors = "order", seed = seed)
##
## Type: Regression
## Number of trees: 500
## Sample size: 744
## Number of independent variables: 8
## Mtry: 2
## Target node size: 5
## Variable importance mode: none
## OOB prediction error (MSE): 8298.542
## R squared (OOB): 0.8190613
# In this exercise you will use the model that you fit in the previous exercise to predict bike rentals for the month of August.
# The predict() function for a ranger model produces a list
# One of the elements of this list is predictions, a vector of predicted values
# You can access predictions with the $ notation for accessing named elements of a list: predict(model, data)$predictions
# The model bike_model_rf and the dataset bikesAugust (for evaluation) are loaded into your workspace.
# bikesAugust is in the workspace
str(bikesAugust)
## 'data.frame': 744 obs. of 13 variables:
## $ hr : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ holiday : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ workingday: logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ weathersit: chr "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
## $ temp : num 0.68 0.66 0.64 0.64 0.64 0.64 0.64 0.64 0.66 0.68 ...
## $ atemp : num 0.636 0.606 0.576 0.576 0.591 ...
## $ hum : num 0.79 0.83 0.83 0.83 0.78 0.78 0.78 0.83 0.78 0.74 ...
## $ windspeed : num 0.1642 0.0896 0.1045 0.1045 0.1343 ...
## $ cnt : int 47 33 13 7 4 49 185 487 681 350 ...
## $ instant : int 13748 13749 13750 13751 13752 13753 13754 13755 13756 13757 ...
## $ mnth : int 8 8 8 8 8 8 8 8 8 8 ...
## $ yr : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pred : num 94.96 51.74 37.98 17.58 9.36 ...
# bike_model_rf is in the workspace
bike_model_rf
## Ranger result
##
## Call:
## ranger(fmla, bikesJuly, num.trees = 500, respect.unordered.factors = "order", seed = seed)
##
## Type: Regression
## Number of trees: 500
## Sample size: 744
## Number of independent variables: 8
## Mtry: 2
## Target node size: 5
## Variable importance mode: none
## OOB prediction error (MSE): 8298.542
## R squared (OOB): 0.8190613
# Make predictions on the August data
bikesAugust$pred <- predict(bike_model_rf, bikesAugust)$predictions
# Calculate the RMSE of the predictions
bikesAugust %>%
mutate(residual = cnt - pred) %>% # calculate the residual
summarize(rmse = sqrt(mean(residual**2))) # calculate rmse
## rmse
## 1 96.73917
# Plot actual outcome vs predictions (predictions on x-axis)
ggplot(bikesAugust, aes(x = pred, y = cnt)) +
geom_point() +
geom_abline()
# The data frame bikesAugust (with predictions) is in the workspace
# The plot quasipoisson_plot of quasipoisson model predictions as a function of time is also in the workspace
# Print quasipoisson_plot
plot(quasipoisson_plot)
# Plot predictions and cnt by date/time
randomforest_plot <- bikesAugust %>%
mutate(instant = (instant - min(instant))/24) %>% # set start to 0, convert unit to days
tidyr::gather(key = valuetype, value = value, cnt, pred) %>%
filter(instant < 14) %>% # first two weeks
ggplot(aes(x = instant, y = value, color = valuetype, linetype = valuetype)) +
geom_point() +
geom_line() +
scale_x_continuous("Day", breaks = 0:14, labels = 0:14) +
scale_color_brewer(palette = "Dark2") +
ggtitle("Predicted August bike rentals, Random Forest plot")
randomforest_plot
# In this exercise you will use vtreat to one-hot-encode a categorical variable on a small example
# vtreat creates a treatment plan to transform categorical variables into indicator variables (coded "lev"), and to clean bad values out of numerical variables (coded "clean").
# To design a treatment plan use the function designTreatmentsZ()
# treatplan <- designTreatmentsZ(data, varlist)
# data: the original training data frame
# varlist: a vector of input variables to be treated (as strings)
# designTreatmentsZ() returns a list with an element scoreFrame: a data frame that includes the names and types of the new variables:
# scoreFrame <- treatplan %>% magrittr::use_series(scoreFrame) %>% select(varName, origName, code)
# varName: the name of the new treated variable
# origName: the name of the original variable that the treated variable comes from
# code: the type of the new variable.
# "clean": a numerical variable with no NAs or NaNs
# "lev": an indicator variable for a specific level of the original categorical variable
# (magrittr::use_series() is an alias for $ that you can use in pipes.)
# For these exercises, we want varName where code is either "clean" or "lev"
# newvarlist <- scoreFrame %>% filter(code %in% c("clean", "lev") %>% magrittr::use_series(varName)
# To transform the data set into all numerical and one-hot-encoded variables, use prepare(): data.treat <- prepare(treatplan, data, varRestrictions = newvarlist)
# treatplan: the treatment plan
# data: the data frame to be treated
# varRestrictions: the variables desired in the treated data
# The data frame dframe and the package magrittr are loaded in the workspace.
# dframe is in the workspace
dframe <- data.frame(color=c('b', 'r', 'r', 'r', 'r', 'b', 'r', 'g', 'b', 'b'),
size=c(13, 11, 15, 14, 13, 11, 9, 12, 7, 12),
popularity=c(1.079, 1.396, 0.922, 1.203, 1.084, 0.804, 1.104, 0.875, 0.695, 0.883),
stringAsFactors=TRUE
)
dframe
## color size popularity stringAsFactors
## 1 b 13 1.079 TRUE
## 2 r 11 1.396 TRUE
## 3 r 15 0.922 TRUE
## 4 r 14 1.203 TRUE
## 5 r 13 1.084 TRUE
## 6 b 11 0.804 TRUE
## 7 r 9 1.104 TRUE
## 8 g 12 0.875 TRUE
## 9 b 7 0.695 TRUE
## 10 b 12 0.883 TRUE
# Create and print a vector of variable names
(vars <- c("color", "size"))
## [1] "color" "size"
# Load the package vtreat
# library(vtreat)
# Create the treatment plan
treatplan <- vtreat::designTreatmentsZ(dframe, vars)
## [1] "designing treatments Fri Apr 27 07:37:17 2018"
## [1] "designing treatments Fri Apr 27 07:37:17 2018"
## [1] " have level statistics Fri Apr 27 07:37:17 2018"
## [1] "design var color Fri Apr 27 07:37:17 2018"
## [1] "design var size Fri Apr 27 07:37:17 2018"
## [1] " scoring treatments Fri Apr 27 07:37:17 2018"
## [1] "have treatment plan Fri Apr 27 07:37:17 2018"
# Examine the scoreFrame
(scoreFrame <- treatplan %>%
use_series(scoreFrame) %>%
select(varName, origName, code))
## varName origName code
## 1 color_lev_x.b color lev
## 2 color_lev_x.g color lev
## 3 color_lev_x.r color lev
## 4 color_catP color catP
## 5 size_clean size clean
# We only want the rows with codes "clean" or "lev"
(newvars <- scoreFrame %>%
filter(code %in% c("clean", "lev")) %>%
use_series(varName))
## [1] "color_lev_x.b" "color_lev_x.g" "color_lev_x.r" "size_clean"
# Create the treated training data
(dframe.treat <- vtreat::prepare(treatplan, dframe, varRestriction = newvars))
## color_lev_x.b color_lev_x.g color_lev_x.r size_clean
## 1 1 0 0 13
## 2 0 0 1 11
## 3 0 0 1 15
## 4 0 0 1 14
## 5 0 0 1 13
## 6 1 0 0 11
## 7 0 0 1 9
## 8 0 1 0 12
## 9 1 0 0 7
## 10 1 0 0 12
# When a level of a categorical variable is rare, sometimes it will fail to show up in training data
# If that rare level then appears in future data, downstream models may not know what to do with it
# When such novel levels appear, using model.matrix or caret::dummyVars to one-hot-encode will not work correctly.
# vtreat is a "safer" alternative to model.matrix for one-hot-encoding, because it can manage novel levels safely
# vtreat also manages missing values in the data (both categorical and continuous).
# In this exercise you will see how vtreat handles categorical values that did not appear in the training set
# The treatment plan treatplan and the set of variables newvars from the previous exercise are still in your workspace
# dframe and a new data frame testframe are also in your workspace
# treatplan is in the workspace
summary(treatplan)
## Length Class Mode
## treatments 3 -none- list
## scoreFrame 8 data.frame list
## outcomename 1 -none- character
## vtreatVersion 1 package_version list
## outcomeType 1 -none- character
## outcomeTarget 1 -none- character
## meanY 1 -none- logical
## splitmethod 1 -none- character
# newvars is in the workspace
(newvars <- c('color_lev_x.b', 'color_lev_x.g', 'color_lev_x.r', 'size_clean'))
## [1] "color_lev_x.b" "color_lev_x.g" "color_lev_x.r" "size_clean"
# Print dframe and testframe
testframe <- data.frame(color=c('g', 'g', 'y', 'g', 'g', 'y', 'b', 'g', 'g', 'r'),
size=c(7, 8, 10, 12, 6, 8, 12, 12, 12, 8),
popularity=c(0.973, 0.912, 1.422, 1.191, 0.987, 1.37, 1.096, 0.916, 1, 1.314),
stringAsFactors=TRUE
)
testframe
## color size popularity stringAsFactors
## 1 g 7 0.973 TRUE
## 2 g 8 0.912 TRUE
## 3 y 10 1.422 TRUE
## 4 g 12 1.191 TRUE
## 5 g 6 0.987 TRUE
## 6 y 8 1.370 TRUE
## 7 b 12 1.096 TRUE
## 8 g 12 0.916 TRUE
## 9 g 12 1.000 TRUE
## 10 r 8 1.314 TRUE
# Use prepare() to one-hot-encode testframe
(testframe.treat <- vtreat::prepare(treatplan, testframe, varRestriction = newvars))
## color_lev_x.b color_lev_x.g color_lev_x.r size_clean
## 1 0 1 0 7
## 2 0 1 0 8
## 3 0 0 0 10
## 4 0 1 0 12
## 5 0 1 0 6
## 6 0 0 0 8
## 7 1 0 0 12
## 8 0 1 0 12
## 9 0 1 0 12
## 10 0 0 1 8
# The outcome column
(outcome <- "cnt")
## [1] "cnt"
# The input columns
(vars <- c("hr", "holiday", "workingday", "weathersit", "temp", "atemp", "hum", "windspeed"))
## [1] "hr" "holiday" "workingday" "weathersit" "temp"
## [6] "atemp" "hum" "windspeed"
# Create the treatment plan from bikesJuly (the training data)
treatplan <- vtreat::designTreatmentsZ(bikesJuly, vars, verbose = FALSE)
# Get the "clean" and "lev" variables from the scoreFrame
(newvars <- treatplan %>%
use_series(scoreFrame) %>%
filter(code %in% c("clean", "lev")) %>% # get the rows you care about
use_series(varName)) # get the varName column
## [1] "hr_lev_x.0"
## [2] "hr_lev_x.1"
## [3] "hr_lev_x.10"
## [4] "hr_lev_x.11"
## [5] "hr_lev_x.12"
## [6] "hr_lev_x.13"
## [7] "hr_lev_x.14"
## [8] "hr_lev_x.15"
## [9] "hr_lev_x.16"
## [10] "hr_lev_x.17"
## [11] "hr_lev_x.18"
## [12] "hr_lev_x.19"
## [13] "hr_lev_x.2"
## [14] "hr_lev_x.20"
## [15] "hr_lev_x.21"
## [16] "hr_lev_x.22"
## [17] "hr_lev_x.23"
## [18] "hr_lev_x.3"
## [19] "hr_lev_x.4"
## [20] "hr_lev_x.5"
## [21] "hr_lev_x.6"
## [22] "hr_lev_x.7"
## [23] "hr_lev_x.8"
## [24] "hr_lev_x.9"
## [25] "holiday_clean"
## [26] "workingday_clean"
## [27] "weathersit_lev_x.Clear.to.partly.cloudy"
## [28] "weathersit_lev_x.Light.Precipitation"
## [29] "weathersit_lev_x.Misty"
## [30] "temp_clean"
## [31] "atemp_clean"
## [32] "hum_clean"
## [33] "windspeed_clean"
# Prepare the training data
bikesJuly.treat <- vtreat::prepare(treatplan, bikesJuly, varRestriction = newvars)
# Prepare the test data
bikesAugust.treat <- vtreat::prepare(treatplan, bikesAugust, varRestriction = newvars)
# Call str() on the treated data
str(bikesJuly.treat)
## 'data.frame': 744 obs. of 33 variables:
## $ hr_lev_x.0 : num 1 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x.1 : num 0 1 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x.10 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x.11 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x.12 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x.13 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x.14 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x.15 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x.16 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x.17 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x.18 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x.19 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x.2 : num 0 0 1 0 0 0 0 0 0 0 ...
## $ hr_lev_x.20 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x.21 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x.22 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x.23 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x.3 : num 0 0 0 1 0 0 0 0 0 0 ...
## $ hr_lev_x.4 : num 0 0 0 0 1 0 0 0 0 0 ...
## $ hr_lev_x.5 : num 0 0 0 0 0 1 0 0 0 0 ...
## $ hr_lev_x.6 : num 0 0 0 0 0 0 1 0 0 0 ...
## $ hr_lev_x.7 : num 0 0 0 0 0 0 0 1 0 0 ...
## $ hr_lev_x.8 : num 0 0 0 0 0 0 0 0 1 0 ...
## $ hr_lev_x.9 : num 0 0 0 0 0 0 0 0 0 1 ...
## $ holiday_clean : num 0 0 0 0 0 0 0 0 0 0 ...
## $ workingday_clean : num 0 0 0 0 0 0 0 0 0 0 ...
## $ weathersit_lev_x.Clear.to.partly.cloudy: num 1 1 1 1 1 1 1 1 1 1 ...
## $ weathersit_lev_x.Light.Precipitation : num 0 0 0 0 0 0 0 0 0 0 ...
## $ weathersit_lev_x.Misty : num 0 0 0 0 0 0 0 0 0 0 ...
## $ temp_clean : num 0.76 0.74 0.72 0.72 0.7 0.68 0.7 0.74 0.78 0.82 ...
## $ atemp_clean : num 0.727 0.697 0.697 0.712 0.667 ...
## $ hum_clean : num 0.66 0.7 0.74 0.84 0.79 0.79 0.79 0.7 0.62 0.56 ...
## $ windspeed_clean : num 0 0.1343 0.0896 0.1343 0.194 ...
str(bikesAugust.treat)
## 'data.frame': 744 obs. of 33 variables:
## $ hr_lev_x.0 : num 1 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x.1 : num 0 1 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x.10 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x.11 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x.12 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x.13 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x.14 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x.15 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x.16 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x.17 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x.18 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x.19 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x.2 : num 0 0 1 0 0 0 0 0 0 0 ...
## $ hr_lev_x.20 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x.21 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x.22 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x.23 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_lev_x.3 : num 0 0 0 1 0 0 0 0 0 0 ...
## $ hr_lev_x.4 : num 0 0 0 0 1 0 0 0 0 0 ...
## $ hr_lev_x.5 : num 0 0 0 0 0 1 0 0 0 0 ...
## $ hr_lev_x.6 : num 0 0 0 0 0 0 1 0 0 0 ...
## $ hr_lev_x.7 : num 0 0 0 0 0 0 0 1 0 0 ...
## $ hr_lev_x.8 : num 0 0 0 0 0 0 0 0 1 0 ...
## $ hr_lev_x.9 : num 0 0 0 0 0 0 0 0 0 1 ...
## $ holiday_clean : num 0 0 0 0 0 0 0 0 0 0 ...
## $ workingday_clean : num 1 1 1 1 1 1 1 1 1 1 ...
## $ weathersit_lev_x.Clear.to.partly.cloudy: num 1 1 1 1 0 0 1 0 0 0 ...
## $ weathersit_lev_x.Light.Precipitation : num 0 0 0 0 0 0 0 0 0 0 ...
## $ weathersit_lev_x.Misty : num 0 0 0 0 1 1 0 1 1 1 ...
## $ temp_clean : num 0.68 0.66 0.64 0.64 0.64 0.64 0.64 0.64 0.66 0.68 ...
## $ atemp_clean : num 0.636 0.606 0.576 0.576 0.591 ...
## $ hum_clean : num 0.79 0.83 0.83 0.83 0.78 0.78 0.78 0.83 0.78 0.74 ...
## $ windspeed_clean : num 0.1642 0.0896 0.1045 0.1045 0.1343 ...
# In this exercise you will get ready to build a gradient boosting model to predict the number of bikes rented in an hour as a function of the weather and the type and time of day
# You will train the model on data from the month of July.
# The July data is loaded into your workspace
# Remember that bikesJuly.treat no longer has the outcome column, so you must get it from the untreated data: bikesJuly$cnt
# You will use the xgboost package to fit the random forest model
# The function xgb.cv() uses cross-validation to estimate the out-of-sample learning error as each new tree is added to the model
# The appropriate number of trees to use in the final model is the number that minimizes the holdout RMSE
# For this exercise, the key arguments to the xgb.cv() call are:
# data: a numeric matrix.
# label: vector of outcomes (also numeric).
# nrounds: the maximum number of rounds (trees to build).
# nfold: the number of folds for the cross-validation. 5 is a good number.
# objective: "reg:linear" for continuous outcomes.
# eta: the learning rate.
# max_depth: depth of trees.
# early_stopping_rounds: after this many rounds without improvement, stop.
# verbose: 0 to stay silent
# The data frames bikesJuly and bikesJuly.treat are in the workspace
# Load the package xgboost
library(xgboost)
##
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
##
## slice
# Run xgb.cv
cv <- xgb.cv(data = as.matrix(bikesJuly.treat),
label = bikesJuly$cnt,
nrounds = 100,
nfold = 5,
objective = "reg:linear",
eta = 0.3,
max_depth = 6,
early_stopping_rounds = 10,
verbose = 0 # silent
)
# Get the evaluation log
elog <- cv$evaluation_log
# Determine and print how many trees minimize training and test error
elog %>%
summarize(ntrees.train = which.min(train_rmse_mean), # find the index of min(train_rmse_mean)
ntrees.test = which.min(test_rmse_mean)) # find the index of min(test_rmse_mean)
## ntrees.train ntrees.test
## 1 77 67
# In this exercise you will fit a gradient boosting model using xgboost() to predict the number of bikes rented in an hour as a function of the weather and the type and time of day
# You will train the model on data from the month of July and predict on data for the month of August
# The datasets for July and August are loaded into your workspace
# Remember the vtreat-ed data no longer has the outcome column, so you must get it from the original data (the cnt column)
# For convenience, the number of trees to use, ntrees from the previous exercise is in the workspace
# The data frames bikesJuly, bikesJuly.treat, bikesAugust and bikesAugust.treat are in the workspace. The number of trees ntrees (84) is in the workspace
# The number of trees to use, as determined by xgb.cv
(ntrees <- 84)
## [1] 84
# Run xgboost
bike_model_xgb <- xgboost(data = as.matrix(bikesJuly.treat), # training data as matrix
label = bikesJuly$cnt, # column of outcomes
nrounds = ntrees, # number of trees to build
objective = "reg:linear", # objective
eta = 0.3,
depth = 6,
verbose = 0 # silent
)
# Make predictions
bikesAugust$pred <- predict(bike_model_xgb, as.matrix(bikesAugust.treat))
# Plot predictions (on x axis) vs actual bike rental count
ggplot(bikesAugust, aes(x = pred, y = cnt)) +
geom_point() +
geom_abline()
# bikesAugust is in the workspace
str(bikesAugust)
## 'data.frame': 744 obs. of 13 variables:
## $ hr : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ holiday : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ workingday: logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ weathersit: chr "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
## $ temp : num 0.68 0.66 0.64 0.64 0.64 0.64 0.64 0.64 0.66 0.68 ...
## $ atemp : num 0.636 0.606 0.576 0.576 0.591 ...
## $ hum : num 0.79 0.83 0.83 0.83 0.78 0.78 0.78 0.83 0.78 0.74 ...
## $ windspeed : num 0.1642 0.0896 0.1045 0.1045 0.1343 ...
## $ cnt : int 47 33 13 7 4 49 185 487 681 350 ...
## $ instant : int 13748 13749 13750 13751 13752 13753 13754 13755 13756 13757 ...
## $ mnth : int 8 8 8 8 8 8 8 8 8 8 ...
## $ yr : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pred : num 48.548 35.349 0.625 -6.652 3.563 ...
# Calculate RMSE
bikesAugust %>%
mutate(residuals = cnt - pred) %>%
summarize(rmse = sqrt(mean(residuals**2)))
## rmse
## 1 76.36407
# Print quasipoisson_plot
quasipoisson_plot
# Print randomforest_plot
randomforest_plot
# Plot predictions and actual bike rentals as a function of time (days)
bikesAugust %>%
mutate(instant = (instant - min(instant))/24) %>% # set start to 0, convert unit to days
tidyr::gather(key = valuetype, value = value, cnt, pred) %>%
filter(instant < 14) %>% # first two weeks
ggplot(aes(x = instant, y = value, color = valuetype, linetype = valuetype)) +
geom_point() +
geom_line() +
scale_x_continuous("Day", breaks = 0:14, labels = 0:14) +
scale_color_brewer(palette = "Dark2") +
ggtitle("Predicted August bike rentals, Gradient Boosting model")
Chapter 1 - Classification Trees
Overview - supervised learning process using classification trees:
Introduction to classification trees:
Overview of the modeling process:
Evaluating classification model performance:
Splitting criteria in trees:
Example code includes:
# Let's get started and build our first classification tree
# A classification tree is a decision tree that performs a classification (vs regression) task
# You will train a decision tree model to understand which loan applications are at higher risk of default using a subset of the German Credit Dataset
# The response variable, called "default", indicates whether the loan went into a default or not, which means this is a binary classification problem (there are just two classes)
# You will use the rpart package to fit the decision tree and the rpart.plot package to visualize the tree
# The data frame creditsub is in the workspace
# This data frame is a subset of the original German Credit Dataset, which we will use to train our first classification tree model
credit <- read.csv("./RInputFiles/credit.csv")
creditsub <- credit %>%
select(months_loan_duration, percent_of_income, years_at_residence, age, default)
# Look at the data
str(creditsub, give.attr=FALSE)
## 'data.frame': 1000 obs. of 5 variables:
## $ months_loan_duration: int 6 48 12 42 24 36 24 36 12 30 ...
## $ percent_of_income : int 4 2 2 2 3 2 3 2 2 4 ...
## $ years_at_residence : int 4 2 3 4 4 4 4 2 4 2 ...
## $ age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ default : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 1 1 1 2 ...
# Create the model
credit_model <- rpart::rpart(formula = default ~ .,
data = creditsub,
method = "class")
# Display the results
rpart.plot::rpart.plot(x = credit_model, yesno = 2, type = 0, extra = 0)
# For this exercise, you'll randomly split the German Credit Dataset into two pieces: a training set (80%) called credit_train and a test set (20%) that we will call credit_test
# We'll use these two sets throughout the chapter.
# The credit data frame is loaded into the workspace.
# Total number of rows in the credit data frame
n <- nrow(credit)
# Number of rows for the training set (80% of the dataset)
n_train <- round(0.8 * n)
# Create a vector of indices which is an 80% random sample
set.seed(123)
train_indices <- sample(1:n, n_train)
# Subset the credit data frame to training indices only
credit_train <- credit[train_indices, ]
# Exclude the training indices to create the test set
credit_test <- credit[-train_indices, ]
# Train the model (to predict 'default')
credit_model <- rpart::rpart(formula = default ~ .,
data = credit_train,
method = "class")
# Look at the model output
print(credit_model)
## n= 800
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 800 238 no (0.70250000 0.29750000)
## 2) checking_balance=> 200 DM,unknown 369 45 no (0.87804878 0.12195122) *
## 3) checking_balance=< 0 DM,1 - 200 DM 431 193 no (0.55220418 0.44779582)
## 6) months_loan_duration< 20.5 231 84 no (0.63636364 0.36363636)
## 12) credit_history=critical,good,poor 207 66 no (0.68115942 0.31884058)
## 24) amount< 7341 200 60 no (0.70000000 0.30000000) *
## 25) amount>=7341 7 1 yes (0.14285714 0.85714286) *
## 13) credit_history=perfect,very good 24 6 yes (0.25000000 0.75000000) *
## 7) months_loan_duration>=20.5 200 91 yes (0.45500000 0.54500000)
## 14) savings_balance=> 1000 DM,unknown 35 9 no (0.74285714 0.25714286)
## 28) amount>=2079 26 2 no (0.92307692 0.07692308) *
## 29) amount< 2079 9 2 yes (0.22222222 0.77777778) *
## 15) savings_balance=< 100 DM,100 - 500 DM,500 - 1000 DM 165 65 yes (0.39393939 0.60606061)
## 30) months_loan_duration< 47.5 132 60 yes (0.45454545 0.54545455)
## 60) age>=29.5 77 35 no (0.54545455 0.45454545)
## 120) amount>=2249 62 24 no (0.61290323 0.38709677)
## 240) credit_history=critical,poor,very good 25 5 no (0.80000000 0.20000000) *
## 241) credit_history=good,perfect 37 18 yes (0.48648649 0.51351351)
## 482) age< 41 21 7 no (0.66666667 0.33333333) *
## 483) age>=41 16 4 yes (0.25000000 0.75000000) *
## 121) amount< 2249 15 4 yes (0.26666667 0.73333333) *
## 61) age< 29.5 55 18 yes (0.32727273 0.67272727)
## 122) months_loan_duration< 31.5 38 16 yes (0.42105263 0.57894737)
## 244) amount>=3415 17 6 no (0.64705882 0.35294118) *
## 245) amount< 3415 21 5 yes (0.23809524 0.76190476) *
## 123) months_loan_duration>=31.5 17 2 yes (0.11764706 0.88235294) *
## 31) months_loan_duration>=47.5 33 5 yes (0.15151515 0.84848485) *
# Generate predicted classes using the model object
class_prediction <- predict(object = credit_model,
newdata = credit_test,
type = "class")
# Calculate the confusion matrix for the test set
caret::confusionMatrix(data = class_prediction,
reference = credit_test$default)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 125 46
## yes 13 16
##
## Accuracy : 0.705
## 95% CI : (0.6366, 0.7672)
## No Information Rate : 0.69
## P-Value [Acc > NIR] : 0.3543
##
## Kappa : 0.192
## Mcnemar's Test P-Value : 3.099e-05
##
## Sensitivity : 0.9058
## Specificity : 0.2581
## Pos Pred Value : 0.7310
## Neg Pred Value : 0.5517
## Prevalence : 0.6900
## Detection Rate : 0.6250
## Detection Prevalence : 0.8550
## Balanced Accuracy : 0.5819
##
## 'Positive' Class : no
##
# Train two models that use a different splitting criterion and use the validation set to choose a "best" model from this group
# To do this you'll use the parms argument of the rpart() function
# This argument takes a named list that contains values of different parameters you can use to change how the model is trained
# Set the parameter split to control the splitting criterion
# The datasets credit_test and credit_train have already been loaded for you
# Train a gini-based model
credit_model1 <- rpart::rpart(formula = default ~ .,
data = credit_train,
method = "class",
parms = list(split = "gini"))
# Train an information-based model
credit_model2 <- rpart::rpart(formula = default ~ .,
data = credit_train,
method = "class",
parms = list(split = "information"))
# Generate predictions on the validation set using the gini model
pred1 <- predict(object = credit_model1,
newdata = credit_test,
type = "class")
# Generate predictions on the validation set using the information model
pred2 <- predict(object = credit_model2,
newdata = credit_test,
type = "class")
# Compare classification error
Metrics::ce(actual = credit_test$default,
predicted = pred1)
## [1] 0.295
Metrics::ce(actual = credit_test$default,
predicted = pred2)
## [1] 0.275
Chapter 2 - Regression Trees
Introduction to regression trees:
Performance metrics for regression:
Hyper-parameters for a decision tree:
Grid search for model selection:
Example code includes:
# These examples will use a subset of the Student Performance Dataset from UCI ML Dataset Repository
# The goal of this exercise is to predict a student's final Mathematics grade based on the following variables:
# sex, age, address, studytime (weekly study time), schoolsup (extra educational support), famsup (family educational support), paid (extra paid classes within the course subject) and absences
# The response is final_grade (numeric: from 0 to 20, output target).
# After the initial exploration, let's split the data into training, validation, test sets
# In this chapter, we will introduce the idea of a validation set, which can be used to select a "best" model from a set of competing models
# In Chapter 1, we demonstrated a simple way to split the data into two pieces using the sample() function
# In this exercise, we will take a slightly different approach to splitting the data that allows us to split the data into more than two parts (here we want three parts: train, validation, test)
# We still use the sample() function, but instead of sampling the indices themselves, we use the sample() function to assign each row to either the training, validation or test sets according to a probability distribution
# The dataset grade is already in your workspace.
grade <- read.csv("./RInputFiles/grade.csv")
# Look/explore the data
str(grade)
## 'data.frame': 395 obs. of 8 variables:
## $ final_grade: num 3 3 5 7.5 5 7.5 5.5 3 9.5 7.5 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ absences : int 6 4 10 2 4 10 0 6 0 0 ...
# Randomly assign rows to ids (1/2/3 represents train/valid/test)
# This will generate a vector of ids of length equal to the number of rows
# The train/valid/test split will be approximately 70% / 15% / 15%
set.seed(1)
assignment <- sample(1:3, size = nrow(grade), prob = c(0.7, 0.15, 0.15), replace = TRUE)
# Create a train, validation and tests from the original data frame
grade_train <- grade[assignment == 1, ] # subset the grade data frame to training indices only
grade_valid <- grade[assignment == 2, ] # subset the grade data frame to validation indices only
grade_test <- grade[assignment == 3, ] # subset the grade data frame to test indices only
# In this exercise, we will use the grade_train dataset to fit a regression tree using rpart() and visualize it using rpart.plot()
# A regression tree plot will look identical to a classification tree plot, with the exception that there will be numeric values in the leaf nodes instead of predicted classes.
# This is very similar to what we did previously in Chapter 1
# When fitting a classification tree, we should use method = "class", however, when fitting a regression tree, we need to set method = "anova"
# By default, the rpart() function will make an intelligent guess as to what the method value should be based on the data type of your response column,
# but it's recommened that you explictly set the method for reproducibility reasons (since the auto-guesser may change in the future)
# The grade_train training set is loaded into the workspace
# Train the model
grade_model <- rpart::rpart(formula = final_grade ~ .,
data = grade_train,
method = "anova")
# Look at the model output
print(grade_model)
## n= 282
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 282 1519.49700 5.271277
## 2) absences< 0.5 82 884.18600 4.323171
## 4) paid=no 50 565.50500 3.430000
## 8) famsup=yes 22 226.36360 2.272727 *
## 9) famsup=no 28 286.52680 4.339286 *
## 5) paid=yes 32 216.46880 5.718750
## 10) age>=17.5 10 82.90000 4.100000 *
## 11) age< 17.5 22 95.45455 6.454545 *
## 3) absences>=0.5 200 531.38000 5.660000
## 6) absences>=13.5 42 111.61900 4.904762 *
## 7) absences< 13.5 158 389.43670 5.860759
## 14) schoolsup=yes 23 50.21739 4.847826 *
## 15) schoolsup=no 135 311.60000 6.033333
## 30) studytime< 3.5 127 276.30710 5.940945 *
## 31) studytime>=3.5 8 17.00000 7.500000 *
# Plot the tree model
rpart.plot::rpart.plot(x = grade_model, yesno = 2, type = 0, extra = 0)
# Generate predictions on a test set
pred <- predict(object = grade_model, # model object
newdata = grade_test) # test dataset
# Compute the RMSE
Metrics::rmse(actual = grade_test$final_grade,
predicted = pred)
## [1] 2.278249
# Plot the "CP Table"
rpart::plotcp(grade_model)
# Print the "CP Table"
print(grade_model$cptable)
## CP nsplit rel error xerror xstd
## 1 0.06839852 0 1.0000000 1.0080595 0.09215642
## 2 0.06726713 1 0.9316015 1.0920667 0.09543723
## 3 0.03462630 2 0.8643344 0.9969520 0.08632297
## 4 0.02508343 3 0.8297080 0.9291298 0.08571411
## 5 0.01995676 4 0.8046246 0.9357838 0.08560120
## 6 0.01817661 5 0.7846679 0.9337462 0.08087153
## 7 0.01203879 6 0.7664912 0.9092646 0.07982862
## 8 0.01000000 7 0.7544525 0.9407895 0.08399125
# Retreive optimal cp value based on cross-validated error
opt_index <- which.min(grade_model$cptable[, "xerror"])
cp_opt <- grade_model$cptable[opt_index, "CP"]
# Prune the model (to optimized cp value)
grade_model_opt <- rpart::prune(tree = grade_model,
cp = cp_opt)
# Plot the optimized model
rpart.plot::rpart.plot(x = grade_model_opt, yesno = 2, type = 0, extra = 0)
# Establish a list of possible values for minsplit and maxdepth
minsplit <- seq(1, 4, 1)
maxdepth <- seq(1, 6, 1)
# Create a data frame containing all combinations
hyper_grid <- expand.grid(minsplit = minsplit, maxdepth = maxdepth)
# Check out the grid
head(hyper_grid)
## minsplit maxdepth
## 1 1 1
## 2 2 1
## 3 3 1
## 4 4 1
## 5 1 2
## 6 2 2
# Print the number of grid combinations
nrow(hyper_grid)
## [1] 24
# Number of potential models in the grid
num_models <- nrow(hyper_grid)
# Create an empty list to store models
grade_models <- list()
# Write a loop over the rows of hyper_grid to train the grid of models
for (i in 1:num_models) {
# Get minsplit, maxdepth values at row i
minsplit <- hyper_grid$minsplit[i]
maxdepth <- hyper_grid$maxdepth[i]
# Train a model and store in the list
grade_models[[i]] <- rpart::rpart(formula = final_grade ~ .,
data = grade_train,
method = "anova",
minsplit = minsplit,
maxdepth = maxdepth)
}
# Earlier in the chapter we split the dataset into three parts: training, validation and test
# A dataset that is not used in training is sometimes referred to as a "holdout" set
# A holdout set is used to estimate model performance and although both validation and test sets are considered to be holdout data, there is a key difference:
# Just like a test set, a validation set is used to evaluate the performance of a model
# The difference is that a validation set is specifically used to compare the performance of a group of models with the goal of choosing a "best model" from the group
# All the models in a group are evaluated on the same validation set and the model with the best performance is considered to the the winner.
# Once you have the best model, a final estimate of performance is computed on the test set.
# A test set should only ever be used to estimate model performance and should not be used in model selection
# Typically if you use a test set more than once, you are probably doing something wrong.
# Number of potential models in the grid
num_models <- length(grade_models)
# Create an empty vector to store RMSE values
rmse_values <- c()
# Write a loop over the models to compute validation RMSE
for (i in 1:num_models) {
# Retreive the i^th model from the list
model <- grade_models[[i]]
# Generate predictions on grade_valid
pred <- predict(object = model,
newdata = grade_valid)
# Compute validation RMSE and add to the
rmse_values[i] <- Metrics::rmse(actual = grade_valid$final_grade,
predicted = pred)
}
# Identify the model with smallest validation set RMSE
best_model <- grade_models[[which.min(rmse_values)]]
# Print the model paramters of the best model
best_model$control
## $minsplit
## [1] 2
##
## $minbucket
## [1] 1
##
## $cp
## [1] 0.01
##
## $maxcompete
## [1] 4
##
## $maxsurrogate
## [1] 5
##
## $usesurrogate
## [1] 2
##
## $surrogatestyle
## [1] 0
##
## $maxdepth
## [1] 1
##
## $xval
## [1] 10
# Compute test set RMSE on best_model
pred <- predict(object = best_model, newdata = grade_test)
Metrics::rmse(actual = grade_test$final_grade, predicted = pred)
## [1] 2.124109
Chapter 3 - Bagged Trees
Introduction to bagged trees:
Evaluating bagged tree performance:
Using caret to cross-validate models:
Example code includes:
# Let's start by training a bagged tree model
# You'll be using the bagging() function from the ipredpackage
# The number of bagged trees can be specified using the nbagg parameter, but here we will use the default (25)
# If we want to estimate the model's accuracy using the "out-of-bag" (OOB) samples, we can set the the coob parameter to TRUE
# The OOB samples are the training obsevations that were not selected into the bootstrapped sample (used in training)
# Since these observations were not used in training, we can use them instead to evaluate the accuracy of the model (done automatically inside the bagging() function)
# Bagging is a randomized model, so let's set a seed (123) for reproducibility
set.seed(123)
# Train a bagged model
credit_model <- ipred::bagging(formula = default ~ .,
data = credit_train,
coob = TRUE)
# Print the model
print(credit_model)
##
## Bagging classification trees with 25 bootstrap replications
##
## Call: bagging.data.frame(formula = default ~ ., data = credit_train,
## coob = TRUE)
##
## Out-of-bag estimate of misclassification error: 0.2788
# Generate predicted classes using the model object
class_prediction <- predict(object = credit_model,
newdata = credit_test,
type = "class") # return classification labels
# Print the predicted classes
print(class_prediction)
## [1] no yes yes no no yes no yes no no no yes no yes no no no
## [18] no no no no no no yes no no no yes no yes yes yes no no
## [35] no no no no no no no no no yes no no no yes no yes yes
## [52] no no yes no no no no no no no no no no no yes no no
## [69] no no yes no no yes no no no no no no no no no no no
## [86] no no no no no yes no yes no no no no yes no no no no
## [103] no no yes no no no no no no no no no no no no no no
## [120] no no no yes no no no no no no no no no no no no no
## [137] no no no no yes no yes no yes no no no no no no no yes
## [154] no no no no no no no no yes no no no no yes yes no no
## [171] no no yes yes no no no no no no no yes no no no no yes
## [188] no no no no yes no no no no yes no no yes
## Levels: no yes
# Calculate the confusion matrix for the test set
caret::confusionMatrix(data = class_prediction,
reference = credit_test$default)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 126 36
## yes 12 26
##
## Accuracy : 0.76
## 95% CI : (0.6947, 0.8174)
## No Information Rate : 0.69
## P-Value [Acc > NIR] : 0.0178277
##
## Kappa : 0.3721
## Mcnemar's Test P-Value : 0.0009009
##
## Sensitivity : 0.9130
## Specificity : 0.4194
## Pos Pred Value : 0.7778
## Neg Pred Value : 0.6842
## Prevalence : 0.6900
## Detection Rate : 0.6300
## Detection Prevalence : 0.8100
## Balanced Accuracy : 0.6662
##
## 'Positive' Class : no
##
# In binary classification problems, we can predict numeric values instead of class labels
# In fact, class labels are created only after you use the model to predict a raw, numeric, predicted value for a test point
# The predicted label is generated by applying a threshold to the predicted value, such that all tests points with predicted value greater than that threshold get a predicted label of "1" and, points below that threshold get a predicted label of "0".
# In this exercise, generate predicted values (rather than class labels) on the test set and evaluate performance based on AUC (Area Under the ROC Curve)
# The AUC is a common metric for evaluating the discriminatory ability of a binary classification model
# Generate predictions on the test set
pred <- predict(object = credit_model,
newdata = credit_test,
type = "prob")
# `pred` is a matrix
class(pred)
## [1] "matrix"
# Look at the pred format
head(pred)
## no yes
## [1,] 0.96 0.04
## [2,] 0.28 0.72
## [3,] 0.36 0.64
## [4,] 0.76 0.24
## [5,] 0.92 0.08
## [6,] 0.48 0.52
# Compute the AUC (`actual` must be a binary (or 1/0 numeric) vector)
(credit_ipred_model_test_auc <- Metrics::auc(actual = ifelse(credit_test$default == "yes", 1, 0),
predicted = pred[,"yes"]
))
## [1] 0.7809724
# Use caret::train() with the "treebag" method to train a model and evaluate the model using cross-validated AUC
# The caret package allows the user to easily cross-validate any model across any relevant performance metric
# In this case, we will use 5-fold cross validation and evaluate cross-validated AUC (Area Under the ROC Curve)
# The credit_train dataset is in your workspace. You will use this data frame as the training data
# Specify the training configuration
ctrl <- caret::trainControl(method = "cv", # Cross-validation
number = 5, # 5 folds
classProbs = TRUE, # For AUC
summaryFunction = caret::twoClassSummary) # For AUC
# Cross validate the credit model using "treebag" method;
# Track AUC (Area under the ROC curve)
set.seed(1) # for reproducibility
credit_caret_model <- caret::train(default ~ ., data = credit_train, method = "treebag",
metric = "ROC", trControl = ctrl
)
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:spatstat':
##
## panel.histogram
# Look at the model object
print(credit_caret_model)
## Bagged CART
##
## 800 samples
## 16 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 641, 640, 640, 639, 640
## Resampling results:
##
## ROC Sens Spec
## 0.7203687 0.8275126 0.4417553
# Inspect the contents of the model list
names(credit_caret_model)
## [1] "method" "modelInfo" "modelType" "results"
## [5] "pred" "bestTune" "call" "dots"
## [9] "metric" "control" "finalModel" "preProcess"
## [13] "trainingData" "resample" "resampledCM" "perfNames"
## [17] "maximize" "yLimits" "times" "levels"
## [21] "terms" "coefnames" "contrasts" "xlevels"
# Print the CV AUC
credit_caret_model$results[,"ROC"]
## [1] 0.7203687
# Generate predictions on the test set
pred <- predict(object = credit_caret_model,
newdata = credit_test,
type = "prob")
# Compute the AUC (`actual` must be a binary (or 1/0 numeric) vector)
(credit_caret_model_test_auc <- Metrics::auc(actual = ifelse(credit_test$default == "yes", 1, 0),
predicted = pred[,"yes"]
))
## [1] 0.7762389
# In this excercise, you will print test set AUC estimates that you computed in previous exercises
# These two methods use the same code underneath, so the estimates should be very similar.
# The credit_ipred_model_test_auc object stores the test set AUC from the model trained using the ipred::bagging() function
# The credit_caret_model_test_auc object stores the test set AUC from the model trained using the caret::train() function with method = "treebag"
# Lastly, we will print the 5-fold cross-validated estimate of AUC that is stored within the credit_caret_model object
# This number will be a more accurate estimate of the true model performance since we have averaged the performance over five models instead of just one
# On small datasets like this one, the difference between test set model performance estimates and cross-validated model performance estimates will tend to be more pronounced
# When using small data, it's recommended to use cross-validated estimates of performance because they are more stable
# Print ipred::bagging test set AUC estimate
print(credit_ipred_model_test_auc)
## [1] 0.7809724
# Print caret "treebag" test set AUC estimate
print(credit_caret_model_test_auc)
## [1] 0.7762389
# Compare to caret 5-fold cross-validated AUC
credit_caret_model$results[, "ROC"]
## [1] 0.7203687
Chapter 4 - Random Forests